Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

Stanford OnlineStanford Online
Education6 min read81 min video
Apr 28, 2026|896 views|36
Save to Pod
TL;DR

Training modern LLMs requires complex "4D" parallelism, combining data, tensor, pipeline, and expert strategies to overcome compute and memory bottlenecks across vast GPU clusters. However, this complexity comes at the cost of intricate system engineering and sophisticated hardware.

Key Insights

1

To train massive language models, parallelism is essential due to compute and memory bottlenecks, requiring strategies to distribute models across multiple machines and GPUs.

2

Data parallelism splits the batch across GPUs but does not reduce memory usage, as each GPU must store a full copy of the model, gradients, and optimizer states.

3

Zero Redundancy Optimizer (ZeRO) stages offer memory savings by sharding optimizer states (Stage 1), gradients (Stage 2), and parameters (Stage 3), with Stage 3 (FSDP) achieving significant memory reduction by fetching parameters on demand.

4

Model parallelism strategies like pipeline (layer-wise) and tensor parallelism (matrix splitting) further break down the model to fit in memory, trading off increased communication for reduced memory footprints.

5

Expert parallelism, used for Mixture-of-Experts (MoE) models, shards experts across devices, offering an alternative to tensor parallelism with advantages in routing sparsity and avoiding small matrix multiplications.

6

Achieving high GPU utilization in large-scale training involves a combination of parallelism strategies (data, tensor, pipeline, expert, sequence), with specific rules of thumb for composing them based on network topology and model architecture, often requiring sophisticated system-level optimizations.

Addressing compute and memory bottlenecks with parallelism

The training of massive language models faces fundamental limitations in compute power and memory capacity, necessitating distributed training across numerous GPUs and machines. The primary bottlenecks are the sheer amount of computation required, which exceeds single-chip capabilities, and the enormous model sizes that cannot fit into the memory of a single GPU. This lecture delves into various parallelization strategies designed to distribute both computation and memory requirements, distinguishing between intra-node parallelism (fast, high-bandwidth connections) and inter-node parallelism (slower, more limited connections). These strategies rely on collective communication primitives like all-reduce, all-gather, and reduce-scatter, which form the algorithmic basis for distributed operations.

Hardware philosophies and their impact on parallelism

Different hardware architectures present distinct networking philosophies that influence parallelization choices. Google's TPUs often employ a toroidal mesh topology, excelling at neighbor-to-neighbor communication and simplifying scaling. This makes them well-suited for predictable communication patterns in dense models. In contrast, NVIDIA's GPUs typically utilize a fat-tree or all-to-all philosophy, offering more flexibility for unpredictable communication, as seen in models like Mixture-of-Experts (MoE) where tokens might route to different experts. Recent hardware developments, such as Google's TPU v8 and its Virgo network, show a convergence toward more all-to-all connectivity, reflecting the evolving needs of large-scale model training and serving, where flexible bandwidth becomes critical.

Data parallelism and its memory limitations

Data parallelism is the most intuitive strategy, where a large batch is divided across multiple GPUs, with each GPU processing a subset of the data. While this scales compute efficiently, it offers no memory savings because each GPU must still store a complete replica of the model parameters, gradients, and importantly, the optimizer states. The memory required for optimizer states, especially in optimizers like Adam, can significantly exceed the memory needed for parameters and gradients themselves, often requiring multiples of the parameter size (e.g., 5x the parameter memory). This complete replication across GPUs makes naive data parallelism impractical for large models.

ZeRO and memory optimization through sharding

The Zero Redundancy Optimizer (ZeRO) tackles the memory limitations of data parallelism by sharding different components of the training state across GPUs. ZeRO Stage 1 shards only the optimizer states, reducing memory by a factor of N (number of GPUs). Stage 2 further shards the gradients, while Stage 3 (Fully Sharded Data Parallel - FSDP) shards all three: optimizer states, gradients, and model parameters. In FSDP, each GPU only materializes the parameters for the layers it's currently processing, gathering them on demand and freeing them afterward. This significantly reduces per-GPU memory usage, enabling larger models to fit. While Stages 1 and 2 offer communication costs equivalent to naive data parallelism (one all-reduce), FSDP introduces additional communication primitives (reduce-scatter, all-gather) but can achieve near-zero overhead by overlapping communication with computation, making it highly efficient.

Model parallelism: Pipeline and Tensor parallelism

Model parallelism strategies break down the model itself across GPUs. Pipeline parallelism (or layer-wise parallelism) splits the model's layers, assigning consecutive layers to different GPUs. This reduces activation memory but introduces significant "pipeline bubbles"—idle time where GPUs wait for data from previous stages. These bubbles can be mitigated by micro-batching, effectively increasing the batch size to keep more GPUs busy. Tensor parallelism (or matrix parallelism) splits individual weight matrices within layers (e.g., in linear layers and attention mechanisms) across GPUs. This strategy incurs communication costs on every matrix multiplication but avoids pipeline bubbles and can effectively reduce activation memory by slicing it across the tensor parallel dimension. Tensor parallelism is best suited for high-bandwidth, low-latency interconnects like NVLink within a node, as inter-node communication can severely degrade performance.

Expert parallelism and sequence/context parallelism

Expert parallelism is another model parallelism strategy, particularly relevant for Mixture-of-Experts (MoE) models. It involves sharding the individual 'experts' (often the feed-forward networks within a transformer block) across different devices. This can be more efficient than tensor parallelism for MoE models, as it allows for sparse routing of tokens and avoids the overhead of small matrix multiplications seen in fine-grained tensor parallelism. Sequence or Context Parallelism, often used for very long sequences, splits activations along the sequence dimension. This is conceptually similar to FSDP in that it materializes necessary activations on demand by sharding them across accelerators, typically using all-gather and reduce-scatter operations. It's particularly useful for extending context length in models during training.

Composing parallelism strategies for optimal utilization

No single parallelism strategy is universally optimal. Large-scale training requires a sophisticated combination of these techniques, often referred to as 3D or 4D parallelism. The general prescription involves using data parallelism to maximize GPU utilization across the largest number of devices. Model parallelism (tensor, expert, pipeline) is then employed to reduce the memory footprint per GPU until the model fits. Tensor and expert parallelism are typically confined to fast intra-node interconnects, while pipeline parallelism is used for inter-node communication. Sequence parallelism further aids in memory reduction for long contexts. Optimizing this composite strategy involves intricate system-level engineering, balancing compute, communication, and memory constraints to achieve high GPU utilization, often drawing on detailed performance analysis and empirical guidelines from large-scale training runs and hardware specifications.

Practical insights from large-scale training configurations

Analysis of recent large-scale training runs reveals common patterns and emerging best practices. For instance, models like LLaMA 3 employ a multi-stage parallelization strategy, using tensor parallelism for dense layers and context parallelism for long-context extension. MoE models like DeepSeek v3 and Mixtral leverage expert parallelism, often combined with pipeline and tensor parallelism, demonstrating the effectiveness of sharding experts. Studies and NVIDIA's parallelism guidelines highlight trade-offs: staying within fast intra-node interconnects for tensor/expert parallelism, using pipeline parallelism for inter-node links, and maximizing data parallelism. The effectiveness of these strategies is underscored by their ability to maintain high GPU utilization even with massive clusters, though it necessitates significant system complexity and careful tuning.

Parallelism Strategies Cheat Sheet

Practical takeaways from this episode

Do This

Use Tensor/Expert Parallel within fast interconnects (single node).
Use Pipeline Parallel for slower, multi-node communication.
Leverage FSDP (Zero Stage 3) to shard parameters, gradients, and optimizer states for memory efficiency.
Maximize Data Parallelism by using most GPUs for data sharding.
When model doesn't fit in memory, use Tensor/Expert Parallel and/or Pipeline Parallel/FSDP.
If batch size is too small, use gradient accumulation for better GPU utilization.
Combine parallelism strategies (3D/4D parallelism) to keep compute units fully utilized.

Avoid This

Don't excessively split matrices in Tensor Parallelism if it significantly reduces GPU utilization.
Avoid relying solely on parameter sharding (Zero Stages 1 & 2) as activation memory is a major bottleneck.
Do not expect perfect utilization from naive Pipeline Parallelism due to bubbles; requires large batch sizes or advanced techniques (e.g., zero-bubble pipelining).
Don't assume one parallelism strategy dominates; understand the trade-offs.
Avoid using Tensor Parallelism beyond fast interconnects (typically 8 GPUs per node).
Be aware that Expert Parallelism still requires careful systems implementation and optimization.

Comparison of Parallelism Strategies

Data extracted from this episode

StrategyPrimary GoalMemory SavingsCommunication CostKey AdvantageKey Disadvantage
Data ParallelismScale ComputeNone (Replicates model)All-reduce gradients (2x parameters)Simple, scales computeNo memory scaling, limited by batch size
Zero Stage 1 (Optimizer State Sharding)Memory ScalingSignificant (Optimizer state)Same as DDP (equiv. to All-reduce)Free memory savingsDoesn't scale parameter/gradient memory
Zero Stage 2 (Gradient & Optimizer State Sharding)Memory ScalingSignificant (Gradients + Optimizer state)Same as DDP (equiv. to All-reduce)More memory savingsStill doesn't scale parameter memory
Zero Stage 3 / FSDP (Full Sharding)Memory ScalingHigh (Parameters, Gradients, Optimizer state)Slightly more than DDP (2 All-gather + Reduce-scatter)Massive memory reduction, potentially overhead-free with overlapComplex, requires communication overlap
Pipeline ParallelismMemory ScalingHigh (Splits layers)Point-to-point (Activation size: B*S*H)Good for slowest links, memory efficientPipeline bubbles (idle time), sensitive to batch size
Tensor ParallelismMemory ScalingHigh (Splits matrices)Activation size (All-reduce, activation-level)No pipeline bubble, low complexityHigh communication overhead, best within node
Expert ParallelismMemory ScalingHigh (Splits experts in MoE)All-to-all dispatches (Latency sensitive)Efficient for MoE, reduces activation memoryComplex implementation, latency sensitive
Sequence/Context ParallelismMemory ScalingHigh (Splits activations by sequence)All-gathers/Reduce-scattersReduces activation memory linearlyReminiscent of FSDP, needs careful management

Common Questions

The primary bottlenecks are compute power and memory capacity. Large language models require more computational resources than a single chip can provide and their parameters often exceed the memory of a single GPU, forcing us to distribute the workload across multiple machines and GPUs.

Topics

Mentioned in this video

Software & Apps
GPU

Graphics Processing Unit, described as lightweight compared to TPUs, with a fat tree networking philosophy and suitability for mixture of experts models.

Virgo network

A higher-level networking stack mentioned in the context of cross-rack connectivity for TPU8.

NVLink

A high-speed interconnect, recommended for keeping expert and tensor parallelism within a single machine.

Adam

An optimization algorithm that requires tracking first and second moments of gradients, leading to significant memory overhead.

DDP

Naive Data Parallelism, used as a baseline for comparison with Zero stages, characterized by replicating model copies across GPUs.

FSDP

Fully Sharded Data Parallelism, a memory-saving technique that shards parameters, gradients, and optimizer states across GPUs, often used in PyTorch.

PyTorch

A deep learning framework mentioned in relation to FSDP implementation and profiling tools.

DPP

DeepSeek's library for expert parallelism, focusing on low-level GPU networking primitives for efficient routing and dispatching.

PTX

GPU machine code, mentioned that DeepSeek used undocumented PTX instructions to accelerate networking communication.

Megatron

NVIDIA's parallelism library and associated guidance document, used as a reference for best practices in parallelism strategies.

OLO

A 7B parameter model trained on the DOMA dataset using FSDP.

Yi

A model trained in the Chinese open-weights setting, using Zero Stage 1, tensor, and pipeline parallelism.

Llama 3

A large dense model for which NVIDIA reported detailed parallelism strategies across different training phases, including long context extension.

Gemini 2

A Google open-source model that uses FSDP, tensor parallel, and sequence parallel, realized on Google's TPU mesh.

Mixtral AX22B

A large mixture-of-experts model where parallelism configurations were investigated, using expert, pipeline, and tensor parallelism.

NVIDIA Bridge

A repository from NVIDIA that provides recommended training configurations for various model sizes and settings.

Quen 3

A model that follows the DeepSeek recipe for parallelism, utilizing a large amount of expert parallel, pipeline parallel, and tensor parallel.

More from Stanford Online

View all 32 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free