Key Moments
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism
Key Moments
Training modern LLMs requires complex "4D" parallelism, combining data, tensor, pipeline, and expert strategies to overcome compute and memory bottlenecks across vast GPU clusters. However, this complexity comes at the cost of intricate system engineering and sophisticated hardware.
Key Insights
To train massive language models, parallelism is essential due to compute and memory bottlenecks, requiring strategies to distribute models across multiple machines and GPUs.
Data parallelism splits the batch across GPUs but does not reduce memory usage, as each GPU must store a full copy of the model, gradients, and optimizer states.
Zero Redundancy Optimizer (ZeRO) stages offer memory savings by sharding optimizer states (Stage 1), gradients (Stage 2), and parameters (Stage 3), with Stage 3 (FSDP) achieving significant memory reduction by fetching parameters on demand.
Model parallelism strategies like pipeline (layer-wise) and tensor parallelism (matrix splitting) further break down the model to fit in memory, trading off increased communication for reduced memory footprints.
Expert parallelism, used for Mixture-of-Experts (MoE) models, shards experts across devices, offering an alternative to tensor parallelism with advantages in routing sparsity and avoiding small matrix multiplications.
Achieving high GPU utilization in large-scale training involves a combination of parallelism strategies (data, tensor, pipeline, expert, sequence), with specific rules of thumb for composing them based on network topology and model architecture, often requiring sophisticated system-level optimizations.
Addressing compute and memory bottlenecks with parallelism
The training of massive language models faces fundamental limitations in compute power and memory capacity, necessitating distributed training across numerous GPUs and machines. The primary bottlenecks are the sheer amount of computation required, which exceeds single-chip capabilities, and the enormous model sizes that cannot fit into the memory of a single GPU. This lecture delves into various parallelization strategies designed to distribute both computation and memory requirements, distinguishing between intra-node parallelism (fast, high-bandwidth connections) and inter-node parallelism (slower, more limited connections). These strategies rely on collective communication primitives like all-reduce, all-gather, and reduce-scatter, which form the algorithmic basis for distributed operations.
Hardware philosophies and their impact on parallelism
Different hardware architectures present distinct networking philosophies that influence parallelization choices. Google's TPUs often employ a toroidal mesh topology, excelling at neighbor-to-neighbor communication and simplifying scaling. This makes them well-suited for predictable communication patterns in dense models. In contrast, NVIDIA's GPUs typically utilize a fat-tree or all-to-all philosophy, offering more flexibility for unpredictable communication, as seen in models like Mixture-of-Experts (MoE) where tokens might route to different experts. Recent hardware developments, such as Google's TPU v8 and its Virgo network, show a convergence toward more all-to-all connectivity, reflecting the evolving needs of large-scale model training and serving, where flexible bandwidth becomes critical.
Data parallelism and its memory limitations
Data parallelism is the most intuitive strategy, where a large batch is divided across multiple GPUs, with each GPU processing a subset of the data. While this scales compute efficiently, it offers no memory savings because each GPU must still store a complete replica of the model parameters, gradients, and importantly, the optimizer states. The memory required for optimizer states, especially in optimizers like Adam, can significantly exceed the memory needed for parameters and gradients themselves, often requiring multiples of the parameter size (e.g., 5x the parameter memory). This complete replication across GPUs makes naive data parallelism impractical for large models.
ZeRO and memory optimization through sharding
The Zero Redundancy Optimizer (ZeRO) tackles the memory limitations of data parallelism by sharding different components of the training state across GPUs. ZeRO Stage 1 shards only the optimizer states, reducing memory by a factor of N (number of GPUs). Stage 2 further shards the gradients, while Stage 3 (Fully Sharded Data Parallel - FSDP) shards all three: optimizer states, gradients, and model parameters. In FSDP, each GPU only materializes the parameters for the layers it's currently processing, gathering them on demand and freeing them afterward. This significantly reduces per-GPU memory usage, enabling larger models to fit. While Stages 1 and 2 offer communication costs equivalent to naive data parallelism (one all-reduce), FSDP introduces additional communication primitives (reduce-scatter, all-gather) but can achieve near-zero overhead by overlapping communication with computation, making it highly efficient.
Model parallelism: Pipeline and Tensor parallelism
Model parallelism strategies break down the model itself across GPUs. Pipeline parallelism (or layer-wise parallelism) splits the model's layers, assigning consecutive layers to different GPUs. This reduces activation memory but introduces significant "pipeline bubbles"—idle time where GPUs wait for data from previous stages. These bubbles can be mitigated by micro-batching, effectively increasing the batch size to keep more GPUs busy. Tensor parallelism (or matrix parallelism) splits individual weight matrices within layers (e.g., in linear layers and attention mechanisms) across GPUs. This strategy incurs communication costs on every matrix multiplication but avoids pipeline bubbles and can effectively reduce activation memory by slicing it across the tensor parallel dimension. Tensor parallelism is best suited for high-bandwidth, low-latency interconnects like NVLink within a node, as inter-node communication can severely degrade performance.
Expert parallelism and sequence/context parallelism
Expert parallelism is another model parallelism strategy, particularly relevant for Mixture-of-Experts (MoE) models. It involves sharding the individual 'experts' (often the feed-forward networks within a transformer block) across different devices. This can be more efficient than tensor parallelism for MoE models, as it allows for sparse routing of tokens and avoids the overhead of small matrix multiplications seen in fine-grained tensor parallelism. Sequence or Context Parallelism, often used for very long sequences, splits activations along the sequence dimension. This is conceptually similar to FSDP in that it materializes necessary activations on demand by sharding them across accelerators, typically using all-gather and reduce-scatter operations. It's particularly useful for extending context length in models during training.
Composing parallelism strategies for optimal utilization
No single parallelism strategy is universally optimal. Large-scale training requires a sophisticated combination of these techniques, often referred to as 3D or 4D parallelism. The general prescription involves using data parallelism to maximize GPU utilization across the largest number of devices. Model parallelism (tensor, expert, pipeline) is then employed to reduce the memory footprint per GPU until the model fits. Tensor and expert parallelism are typically confined to fast intra-node interconnects, while pipeline parallelism is used for inter-node communication. Sequence parallelism further aids in memory reduction for long contexts. Optimizing this composite strategy involves intricate system-level engineering, balancing compute, communication, and memory constraints to achieve high GPU utilization, often drawing on detailed performance analysis and empirical guidelines from large-scale training runs and hardware specifications.
Practical insights from large-scale training configurations
Analysis of recent large-scale training runs reveals common patterns and emerging best practices. For instance, models like LLaMA 3 employ a multi-stage parallelization strategy, using tensor parallelism for dense layers and context parallelism for long-context extension. MoE models like DeepSeek v3 and Mixtral leverage expert parallelism, often combined with pipeline and tensor parallelism, demonstrating the effectiveness of sharding experts. Studies and NVIDIA's parallelism guidelines highlight trade-offs: staying within fast intra-node interconnects for tensor/expert parallelism, using pipeline parallelism for inter-node links, and maximizing data parallelism. The effectiveness of these strategies is underscored by their ability to maintain high GPU utilization even with massive clusters, though it necessitates significant system complexity and careful tuning.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●Concepts
●People Referenced
Parallelism Strategies Cheat Sheet
Practical takeaways from this episode
Do This
Avoid This
Comparison of Parallelism Strategies
Data extracted from this episode
| Strategy | Primary Goal | Memory Savings | Communication Cost | Key Advantage | Key Disadvantage |
|---|---|---|---|---|---|
| Data Parallelism | Scale Compute | None (Replicates model) | All-reduce gradients (2x parameters) | Simple, scales compute | No memory scaling, limited by batch size |
| Zero Stage 1 (Optimizer State Sharding) | Memory Scaling | Significant (Optimizer state) | Same as DDP (equiv. to All-reduce) | Free memory savings | Doesn't scale parameter/gradient memory |
| Zero Stage 2 (Gradient & Optimizer State Sharding) | Memory Scaling | Significant (Gradients + Optimizer state) | Same as DDP (equiv. to All-reduce) | More memory savings | Still doesn't scale parameter memory |
| Zero Stage 3 / FSDP (Full Sharding) | Memory Scaling | High (Parameters, Gradients, Optimizer state) | Slightly more than DDP (2 All-gather + Reduce-scatter) | Massive memory reduction, potentially overhead-free with overlap | Complex, requires communication overlap |
| Pipeline Parallelism | Memory Scaling | High (Splits layers) | Point-to-point (Activation size: B*S*H) | Good for slowest links, memory efficient | Pipeline bubbles (idle time), sensitive to batch size |
| Tensor Parallelism | Memory Scaling | High (Splits matrices) | Activation size (All-reduce, activation-level) | No pipeline bubble, low complexity | High communication overhead, best within node |
| Expert Parallelism | Memory Scaling | High (Splits experts in MoE) | All-to-all dispatches (Latency sensitive) | Efficient for MoE, reduces activation memory | Complex implementation, latency sensitive |
| Sequence/Context Parallelism | Memory Scaling | High (Splits activations by sequence) | All-gathers/Reduce-scatters | Reduces activation memory linearly | Reminiscent of FSDP, needs careful management |
Common Questions
The primary bottlenecks are compute power and memory capacity. Large language models require more computational resources than a single chip can provide and their parameters often exceed the memory of a single GPU, forcing us to distribute the workload across multiple machines and GPUs.
Topics
Mentioned in this video
Developer of TPUs and announced TPU AI / TPU8. Also associated with Gemini 2 model.
Manufacturer of GPUs and Megatron parallelism library. Mentioned for their H200 and other systems.
A company known for its systems skills in AI, developed DPP library for expert parallelism, and trained models like DeepSeek v1 and v3.
Google's accelerator, characterized by its toroidal mesh networking, suitable for dense models and predictable partitions.
A TPU model with a tree topology and an all-to-all connection philosophy, suitable for modern language models and training.
A Chinese AI chip with a large number of interconnected chips, offering high scalability at the cost of high power consumption.
A GPU model mentioned as a point of comparison for the Huawei Ascend 910's performance.
Graphics Processing Unit, described as lightweight compared to TPUs, with a fat tree networking philosophy and suitability for mixture of experts models.
A higher-level networking stack mentioned in the context of cross-rack connectivity for TPU8.
A high-speed interconnect, recommended for keeping expert and tensor parallelism within a single machine.
An optimization algorithm that requires tracking first and second moments of gradients, leading to significant memory overhead.
Naive Data Parallelism, used as a baseline for comparison with Zero stages, characterized by replicating model copies across GPUs.
Fully Sharded Data Parallelism, a memory-saving technique that shards parameters, gradients, and optimizer states across GPUs, often used in PyTorch.
A deep learning framework mentioned in relation to FSDP implementation and profiling tools.
DeepSeek's library for expert parallelism, focusing on low-level GPU networking primitives for efficient routing and dispatching.
GPU machine code, mentioned that DeepSeek used undocumented PTX instructions to accelerate networking communication.
NVIDIA's parallelism library and associated guidance document, used as a reference for best practices in parallelism strategies.
A 7B parameter model trained on the DOMA dataset using FSDP.
A model trained in the Chinese open-weights setting, using Zero Stage 1, tensor, and pipeline parallelism.
A large dense model for which NVIDIA reported detailed parallelism strategies across different training phases, including long context extension.
A Google open-source model that uses FSDP, tensor parallel, and sequence parallel, realized on Google's TPU mesh.
A large mixture-of-experts model where parallelism configurations were investigated, using expert, pipeline, and tensor parallelism.
A repository from NVIDIA that provides recommended training configurations for various model sizes and settings.
A model that follows the DeepSeek recipe for parallelism, utilizing a large amount of expert parallel, pipeline parallel, and tensor parallel.
More from Stanford Online
View all 32 summaries
101 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance
82 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism
87 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA
72 minStanford CS25: Transformers United V6 I From Representation Learning to World Modeling
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free