What are the three main types of LM application archetypes discussed?

The archetypes are: Chatbot Plus (interactive, like ChatGPT), Background Agents (automated tasks, don't require immediate human response), and Data Processors (extracting structure from unstructured data for storage or later querying).

How are workloads defined for inference systems?

Workloads are defined by metrics and constraints like queries per second (QPS), expected input/output tokens, prefix reuse, and latency budgets (time to first token and time per output token).

What's the difference between efficiency-bound and capability-bound models?

Efficiency-bound means satisfactory intelligence where cost is the primary driver, often using open-source models. Capability-bound means the highest existing intelligence is insufficient, typically requiring very large, proprietary models.

What are the main open-source inference engines available?

The primary engines discussed are TensorRT-LLM (from NVIDIA, best for small models/batches), VLLM (wide adoption, enterprise flavor), and SG Lang (performance-focused, startup culture).

Why are recent NVIDIA data center GPUs like H100/B200 preferred for inference today?

These GPUs feature High Bandwidth Memory (HBM) crucial for decode performance, offer better power delivery and cooling via the SXM form factor, and have advanced interconnects like NVLink for scaling.

What are the challenges with deploying inference systems at scale?

Challenges include the scarcity and cost of GPUs, hardware failures (GPUs failing in weeks/days), and significant variability in traffic demand, requiring dynamic provisioning and fast replica startup.

How can developers ensure fast replica startup for inference servers?

Strategies include operating a buffer of idle machines, lazily loading file systems while eagerly loading essential components like PyTorch, and using checkpoint restore technologies to avoid lengthy initialization times.

What are common types of bugs found in inference systems?

Bugs fall into three categories: application-level (shared responsibility, tricky), model quality (train-serve skew, often due to tokenizer issues), and performance bugs (regressions or cross-replica differences).

What are the key metrics to log for inference system observability?

Essential metrics include time to first token, time per output token, time to last token, QPS, workload split (prefill/decode), cached prefill rate, and hardware metrics like temperature and power draw.

What are the most impactful performance optimization techniques for inference?

The biggest levers are speculative decoding (significantly speeding up decode by using a draft model) and quantization (reducing model size and memory bandwidth by using lower precision formats like FP8 or FP4).

What is speculative decoding and why is it effective?

Speculative decoding uses a smaller, faster 'draft' model to predict several tokens ahead, which are then verified by the main model. This significantly speeds up inference, especially when memory bandwidth is the bottleneck, offering speedups of 2x to 8x or more.

Key Moments

Stanford CS25: Transformers United V6 I Serving Transformers: Lessons from the Trenches

Stanford Online

Education6 min read83 min video

Jun 4, 2026|952 views|34|1

Stanford Stanford Online Transformers AI Artificial Intelligence

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

Serving transformer models at scale is a rapidly evolving field with new engines and hardware, but efficient inference requires careful attention to application needs, workload definition, and debugging.

Key Insights

Inference, often seen as the 'boring infra' of AI, is crucial for generating revenue and attracting resources, unlike training which is a cost center.

The choice between 'efficiency-bound' and 'capability-bound' models dictates whether cost or raw intelligence is the primary driver for inference engineering.

Hardware limitations, particularly memory bandwidth during the decode phase, necessitate specialized data center GPUs (SXM form factor) for efficient inference.

Over-provisioning hardware for peak inference demand can lead to as low as 30-40% utilization, highlighting the need for fast, automatic resource allocation.

Observability, especially logging token IDs and detailed performance metrics (median and tail latencies, QPS, prefill/decode split), is critical for debugging inference systems.

Performance optimization levers include speculative decoding (2x-8x speedups) and quantization (e.g., FP4), but require full-stack coordination and application-level validation.

The critical, yet overlooked, domain of inference

Inference, the process of using trained AI models to generate outputs, is presented not as a mere technicality but as the crucial revenue-generating engine of AI businesses. While training models garners significant attention for its research breakthroughs, it's inference that translates model weights into usable products and services. The speaker emphasizes that inference attracts substantial resources and is essential for capturing value, making it a "revenue center" rather than a "cost center." Even early-stage companies relying on venture capital need to demonstrate revenue streams, which are fundamentally enabled by efficient inference. Furthermore, inference is becoming increasingly integrated into the training loop itself, particularly through reinforcement learning, where generating model outputs to interact with the world and feed back into weights can consume more computational resources than pre-training. This growing demand across various AI applications underscores the importance of mastering inference engineering.

Defining the inference workload: From applications to metrics

Understanding the specific needs of an application is paramount for designing an effective inference system. Three archetypes are proposed: 'Chatbot Plus' (interactive, akin to ChatGPT, with human latency tolerances), 'Background Agents' (perform tasks autonomously, with multi-second to multi-hour latency constraints), and 'Data Processors' (extracting structure from unstructured data for downstream systems, tolerating higher latency but often dealing with bursty traffic). For the inference engineer, these translate into defining a 'workload' with specific Service Level Agreements (SLAs) or Objectives (SLOs). Key metrics for defining workloads include queries per second (QPS), which is user-driven and variable; the number of input and output tokens (which the model's stopping criteria makes hard to predict); prefix reuse (essential for caching computations); and latency budgets, specifically time to first token and time per output token. Understanding these metrics per replica and then scaling up is vital for efficient resource allocation.

Efficiency vs. Capability: Choosing the right model family

The landscape of transformer models for inference can be broadly categorized into two regimes: efficiency-bound and capability-bound. In the efficiency-bound regime, model intelligence is already sufficient for the task, and the primary concern becomes cost. This domain is largely dominated by open-source models, often deployed as single-GPU instances (ranging from 1 to 50 billion parameters). While multi-GPU setups can offer lower latency, they are less common here as these workloads often don't require human-interactive speeds. In contrast, capability-bound workloads demand the highest possible intelligence, where current models may not yet suffice. These typically involve larger, multi-GPU, and even multi-node deployments. Proprietary models often lead here, though fine-tuned open-source models are rapidly catching up. This distinction is critical because it dictates the engineering choices, from model size and hardware requirements to the economic trade-offs.

Hardware essentials: The critical role of data center GPUs

The physical infrastructure for inference is dominated by specialized hardware. The distinction between pre-fill (processing many input tokens) and decode (generating output tokens) phases is crucial. Decode, in particular, is heavily memory-bandwidth bound. Current hardware trends favor increasing arithmetic intensity, which exacerbates this bottleneck. Consequently, recent data center GPUs, specifically those with SXM form factors (like NVIDIA's H100 or B200), are essential. These offer high-bandwidth memory (HBM) soldered directly onto the substrate, providing ultra-low latency and bandwidth crucial for the decode phase, along with better power delivery and cooling necessary for sustained high performance. Tensor Cores, specialized matrix multiplication units, are now the primary compute workhorses on these GPUs, making architectures that heavily utilize them, like standard transformers, well-suited.

Deployment challenges: Scarcity, failure, and traffic variability

Inference is predominantly served from data centers due to the scarcity and cost of high-performance hardware like GPUs with HBM. Network latency can significantly impact ultra-low latency budgets, pushing deployments towards regional or edge-in-data-center configurations. The cost and demand for these resources necessitate maximizing hardware utilization. GPUs, particularly newer ones, have relatively short mean times to failure (weeks or days), requiring robust systems with redundancy. Unlike training, where a single GPU failure can halt the entire process, inference systems can route around failures by using independent replicas. However, a major challenge is traffic variability. Unpredictable swings in demand can lead to severe under-utilization if hardware is provisioned for peak loads. Solutions focus on fast, automatic scaling and efficient resource allocation to maximize utilization and maintain quality of service, avoiding paying for idle hardware.

Achieving high utilization: Fast scaling and efficient startup

To combat under-utilization, systems must enable fast and automatic GPU allocation. This involves operating a buffer of idle machines ready to handle traffic spikes, especially in multi-tenant environments. It also requires minimizing the time it takes to start new inference replicas. Key strategies include lazily loading file systems while eagerly fetching essential components like PyTorch and core OS libraries concurrently with replica startup, storing them in a multi-tier cloud cache. Furthermore, just-in-time compilation and other application startup processes, critical for engines like TorchScript (JIT) or DLM, can take minutes. Technologies like CRIU and CUDA checkpointing, or methods like NVIDIA's GPU memory service, enable faster startup by checkpointing and restoring running processes and GPU states, essentially treating application state as data that can be saved and loaded more quickly than recreated.

Observability and debugging: From logs to token IDs

Debugging inference systems requires robust observability, defined as the ability to diagnose issues solely from logs. Common bug categories include application-level issues (often shared with application developers), model quality bugs (e.g., train-serve skew), and performance bugs. Tokenizer bugs and inconsistencies in chat templates are highlighted as particularly problematic and common sources of subtle errors. Essential logging practices include recording token IDs alongside strings, comprehensive performance metrics (time to first/last token, QPS, prefill/decode volume, cache hit rates), and hardware metrics (temperature, power, utilization). Measurements should be taken at both replica and aggregate levels, considering both median and tail latencies. These detailed logs are crucial for identifying bottlenecks, regressions, and cross-replica differences, enabling swift diagnosis and resolution without needing complex reproduction steps.

Performance optimization: Speculation, quantization, and host-side work

Significant performance gains in inference can be achieved through several methods. Speculative decoding, which uses a smaller 'draft' model to predict ahead, can yield 2x-8x speedups, especially when memory bandwidth is the bottleneck. Quantization, reducing precision (e.g., from FP8 to FP4), offers up to a 2x speedup by halving memory demand and increasing computational throughput. Both techniques, however, require full-stack coordination and application-level validation, as they can alter model behavior. Other optimizations focus on the host side, such as enabling CUDA capture to consolidate kernel launches and optimizing Python-level operations using tools like PySpy to avoid blocking the GPU. GPU-level kernel optimization is typically reserved for last, as gains are often in the percentage points and require specialized tools like Nsight Systems.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

Common Questions

Training models is a cost center, while inference is how models are used and generate revenue, making it crucial for business sustainability and attracting investment. Efficient inference is key to turning model weights into a viable product.

Topics

AI & Machine Learning Technology & Innovation Programming & Software Large Language Models Cloud Computing AI Infrastructure GPU Computing Performance Tuning Inference Optimization LLM Deployment Model Serving

Mentioned in this video

Software & Apps

TensorRT

The runtime underlying NVIDIA's TensorRT-LLM.

Claude

Another example of a chatbot application that fits the 'chatbot plus' archetype.

Gemma

A model family mentioned among options for efficiency-bound regimes.

ChatGPT

An example of a chatbot application that fits the 'chatbot plus' archetype.

Devon

A background agent example used for implementing features and opening PRs.

Mistral

A model family mentioned among options for efficiency-bound regimes.

vLLM

An open-source inference engine with wide adoption and enterprise flavor.

LangSmith

A specialized tool mentioned for debugging LLM applications and model quality.

PyTorch

Mentioned as a common interchange for CPU and GPU operations.

Linux

The operating system whose core libraries are eagerly loaded during container starts at Modal.

LangChain

A tool for building LLM applications, mentioned alongside specialized tools for model quality debugging.

Modal

The speaker's company, which offers serverless computing and focuses on engineering systems.

AWS

Mentioned in the context of trusting models with sensitive accounts like the root account for monitoring setup.

Grok

Developing LPUs that will be included in NVIDIA's next-generation hybrid racks.

Organizations

University of California, Berkeley

University where Charles obtained his PhD, focusing on neural network optimization.

Han Lab

Associated with the paper 'vibeserve' on writing bespoke inference engines.

Companies

NVIDIA

Mentioned as a company that fabs chips and produces GPUs.

DeepSeek

A model family mentioned among options for efficiency-bound regimes.

Weights & Biases

A company where Charles worked, involved in full-stack deep learning and model development.

OpenAI

Mentioned in the context of company raises and AI infrastructure.

AMD

Mentioned as an alternative GPU provider, though programming is noted as challenging.

Intel

Mentioned in the context of potential issues with HBM feeding CPUs.

Google

Mentioned as the creator of TPUs, which inspired NVIDIA's Tensor Cores.

Brain Trust

A specialized tool mentioned for debugging LLM applications and model quality.

Products

B200

Potentially an NVIDIA GPU model mentioned alongside H100.

H100

An NVIDIA GPU mentioned as an example of a data center SXM machine.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

Stanford CS25: Transformers United V6 I Serving Transformers: Lessons from the Trenches

Want to know something specific about what's covered?

Key Insights

The critical, yet overlooked, domain of inference

Defining the inference workload: From applications to metrics

Efficiency vs. Capability: Choosing the right model family

Hardware essentials: The critical role of data center GPUs

Deployment challenges: Scarcity, failure, and traffic variability

Achieving high utilization: Fast scaling and efficient startup

Observability and debugging: From logs to token IDs

Performance optimization: Speculation, quantization, and host-side work

Mentioned in This Episode

Common Questions

Topics

Mentioned in this video

More from Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality

Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning

Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Ask anything from this episode.