Key Moments
Stanford CS25: Transformers United V6 I The Ultra-Scale Talk: Scaling Training to Thousands of GPUs
Key Moments
Training LLMs on thousands of GPUs requires a complex blend of data, tensor, pipeline, sequence, and expert parallelism, each with trade-offs between memory, communication, and implementation complexity.
Key Insights
Large Language Models (LLMs) are scaling to trillions of parameters and trained on massive datasets (e.g., 15 trillion tokens) with context lengths up to 1 million.
Data Parallelism (DP) variants like ZeRO-1, ZeRO-2, and ZeRO-3 (FSDP) progressively reduce memory by sharding optimizer states, gradients, and model parameters, with ZeRO-3 offering the most memory savings at the cost of increased communication.
Tensor Parallelism (TP) shards model weights and computation across GPUs, enabling larger models but requiring significant communication (ALL_REDUCE) at each block and careful synchronization of parameters outside the TP region.
Sequence Parallelism (SP) is a flavor of TP that shards activations along the sequence dimension, reducing memory for activations and eliminating the need to synchronize LayerNorm gradients but still incurring high communication costs.
Pipeline Parallelism (PP) vertically shards layers across GPUs, reducing memory per GPU but introducing a 'pipeline bubble' (idle time) that requires complex schedulers (e.g., 1F1B, DualPipe) or many micro-batches to mitigate.
Expert Parallelism (EP) shards experts in Mixture-of-Experts (MoE) models, requiring ALL_TO_ALL communication for routing tokens and often combined with DP and PP to manage communication overhead and data parallelism requirements.
The imperative for scaling LLMs
The current trend in Large Language Models (LLMs) is relentless scaling, driven by observed correlations between model size and intelligence. Models now routinely reach trillions of parameters, trained on datasets of 15 trillion tokens with context lengths extending to 1 million. This necessitates massive computational infrastructure, placing immense pressure on GPUs for data loading, computation, and checkpointing within tight time constraints (e.g., 1 second per training iteration for a trillion-parameter model).
Mastering memory with ZeRO and data parallelism
Dealing with models that exceed single GPU memory requires distributed strategies. Data Parallelism (DP) is a foundational technique where data is sharded across GPUs, each holding a replica of the model and gradients. To synchronize gradients, the ALL_REDUCE collective operation is used. Efficient overlap of communication with computation, achieved by bucketing gradients, is crucial to minimize GPU idle time. However, standard DP duplicates optimizer states and can't address models larger than GPU memory. The ZeRO (Zero Redundancy Optimizer) family, including ZeRO-1, ZeRO-2, and ZeRO-3 (also known as Fully Sharded Data Parallelism or FSDP), progressively optimizes memory. ZeRO-1 shards optimizer states, ZeRO-2 further shards gradients, and ZeRO-3 shards model parameters as well. ZeRO-3 significantly reduces memory by ensuring each GPU only holds a shard of parameters, optimizer states, and gradients, allowing for training of massive models. The trick involves replacing ALL_REDUCE with REDUCE_SCATTER followed by ALL_GATHER, enabling distributed optimization steps.
Sharding models with tensor and sequence parallelism
When models are too large even for ZeRO strategies, Tensor Parallelism (TP) offers a solution by sharding the model's weights and computations across multiple GPUs. For instance, in matrix multiplications, weights can be split column-wise and row-wise, requiring ALL_REDUCE operations in both forward and backward passes to maintain correctness. This reduces the memory footprint and computation per GPU but introduces significant communication overhead at each block. A critical assumption for TP is that activations and gradients remain synchronized across GPUs outside the sharded dimension, which often requires gradient synchronization for parameters not within the TP region due to numerical drift. Sequence Parallelism (SP) further optimizes TP by sharding activations along the sequence dimension. This is particularly beneficial for long context training. SP leverages the REDUCE_SCATTER and ALL_GATHER primitives, similar to ZeRO, to shard activations and synchronize them, thereby reducing memory usage and eliminating the need for explicit LayerNorm gradient synchronization. However, like TP, SP remains communication-heavy.
Stitching layers with pipeline parallelism
Pipeline Parallelism (PP) vertically shards the model by distributing layers across different GPUs. GPU 1 handles initial layers, GPU 2 the next set, and so on. This significantly reduces the memory required per GPU as each only holds a fraction of the model layers. The primary challenge with PP is the 'pipeline bubble'—periods of GPU idle time as data moves between stages. To mitigate this, complex scheduling strategies like '1 forward, 1 backward' (1F1B) or more advanced methods like DualPipe are employed. These schedules interleave forward and backward passes for multiple micro-batches to keep GPUs busy. PP requires saving activations for multiple micro-batches, often necessitating activation checkpointing or CPU offloading to manage memory. It scales batch size with the number of micro-batches and is relatively communication-cheap, involving only activation and gradient exchanges between adjacent stages.
Handling experts in Mixture-of-Experts (MoE) with expert parallelism
For Mixture-of-Experts (MoE) models, Expert Parallelism (EP) is essential. It shards the experts across GPUs. This requires not only sharding the experts but also the data across GPUs (making it less orthogonal to DP) to ensure all experts can be utilized. The core communication primitive in EP is ALL_TO_ALL, used for dispatching tokens to their assigned experts and combining results. A significant bottleneck in EP is the CPU-GPU synchronization required for the router to compute expert assignments and buffer sizes, especially without specialized hardware like InfiniBand with IBGDA. This CPU-GPU sync can drastically slow down training. To address the communication overhead and CPU-GPU sync, EP is often combined with PP, interleaving the ALL_TO_ALL operations with forward/backward passes of other batches.
Combining parallelism strategies for ultra-scale training
Achieving ultra-scale training with thousands of GPUs necessitates combining these parallelism strategies. Data, Tensor, Pipeline, Sequence, and Expert Parallelism are designed to be orthogonal, allowing them to be composed into a 'parallelism mesh'. Frameworks like PyTorch (with FSDP, Tensor Parallelism utilities), Megatron-LM, and Nanotron abstract these complexities, enabling users to configure different parallelism dimensions. The choice of which strategies to employ, and to what degree, is highly dependent on hardware constraints (GPU memory, network interconnects like InfiniBand) and model characteristics (size, architecture, sequence length). For instance, TP is often confined to single nodes due to its communication intensity, while PP can span multiple nodes. The ultimate goal is to maximize hardware utilization, minimize idle time, and efficiently manage memory and communication across the distributed system.
Mentioned in This Episode
●Products
●Software & Apps
●Organizations
●Books
●Concepts
●People Referenced
Common Questions
Scaling is crucial because larger models demonstrate improved intelligence and performance. The trend in LLMs is towards larger parameter counts and training on vast amounts of data, which necessitates advanced scaling techniques.
Topics
Mentioned in this video
A scaling technique discussed in relation to large models, with Nuaman Tazzy's work spanning initiatives in this area.
A parallelism strategy specifically designed to efficiently partition large sequences, particularly in attention mechanisms.
Zero-Redundancy Optimizer, a memory optimization technique for distributed training that shards optimizer states, gradients, and parameters.
A memory optimization technique where activations are moved from GPU to CPU memory to free up GPU VRAM.
Large Language Models, whose trend is to train larger and larger models, correlating with increased intelligence.
A function or mechanism within PyTorch's distributed tensor parallelism library that facilitates combining different parallelism strategies.
Integrated Buffer Access or similar hardware feature enabling efficient communication between GPUs and CPUs for tasks like dispatching in Expert Parallelism.
Remote Direct Memory Access, a networking technology that allows direct memory access between computing nodes, improving communication efficiency.
A project worked on by Nuaman Tazzy at HuggingFace.
An open-source machine learning framework used for implementing distributed training strategies like DDP and FSDP.
A project worked on by Nuaman Tazzy at HuggingFace.
An optimizer that requires full tensors, posing a challenge for parameter sharding strategies that flatten tensors.
An optimized attention mechanism that inspired the online softmax approach used in Ring Attention.
An open-source distributed training library developed by HuggingFace.
An example of a recent large language model with one trillion parameters.
A library for large-scale deep learning model training, mentioned in the context of combining parallelism and its implementation.
Graphics Processing Unit, a type of accelerator used for training large models, limited by VRAM.
Fully Sharded Data Parallelism, a PyTorch implementation (also referred to as ZeRO Stage 3) that shards model parameters, gradients, and optimizer states across GPUs.
Distributed Data Parallelism, a class in PyTorch used for data parallelism that automatically handles gradient synchronization.
More from Stanford Online
View all 48 summaries
69 minStanford CS153 Frontier Systems | Jensen Huang from NVIDIA on the Compute Behind Intelligence
61 minStanford CS153 Frontier Systems | Scott Nolan from General Matter on Energy Bottlenecks
63 minStanford Robotics Seminar ENGR319 | Spring 2026 | Unlocking Autonomous Medical Robotics
86 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 10: Inference
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free