What is the difference between intra-node and inter-node parallelism?

Intra-node parallelism refers to communication within a single machine (e.g., between GPUs), which is typically very fast. Inter-node parallelism involves communication between different machines, which is slower and requires strategies that are more communication-efficient.

How does the networking topology of TPUs differ from GPUs, and why does it matter?

TPUs commonly use a toroidal mesh for networking, allowing simple scaling with consistent neighbor connections. GPUs often use a fat tree, an all-to-all philosophy which can become more complex as the network grows. This difference impacts which parallelism strategies are most effective on each.

What is Data Parallelism and what are its limitations?

Data Parallelism involves splitting the training batch across multiple GPUs, with each GPU processing a portion of the data. While it scales compute, it offers no memory savings as each GPU holds a full copy of the model, and communication overhead is significant for large models.

How do the Zero Parallism stages (1, 2, and 3/FSDP) improve memory efficiency?

Zero Stage 1 shards optimizer states, Stage 2 shards gradients and optimizer states, and FSDP (Stage 3) shards everything: parameters, gradients, and optimizer states. This drastically reduces the memory footprint per GPU, enabling larger models to be trained.

What is Pipeline Parallelism and why is it often considered 'terrible' initially?

Pipeline Parallelism splits the model layer-wise across GPUs. Initially, it leads to very low GPU utilization due to 'bubbles' where GPUs are idle waiting for data. This can be mitigated with micro-batching and large batch sizes.

What is Tensor Parallelism and where is it typically applied?

Tensor Parallelism splits matrices within layers (like in matrix multiplications) across GPUs. It's highly computation-hungry and communication-intensive, making it best suited for fast intra-node interconnects (e.g., within a single server).

Why is activation memory a critical factor in large model training, and how can it be reduced?

Activations, which store intermediate computations, can dwarf parameter memory in large models, especially with long sequences. Strategies like Tensor Parallelism, Sequence Parallelism, and activation recomputation can reduce activation memory.

When should Expert Parallelism be preferred over Tensor Parallelism?

For Mixture-of-Experts (MoE) models, Expert Parallelism is generally preferred over Tensor Parallelism because it's more efficient at routing sparse token activations and avoids issues with very small matrices seen in finely split Tensor Parallelism.

What is the general prescription for combining different parallelism strategies?

A common strategy is: use Tensor/Expert Parallelism within a node, Pipeline Parallelism across nodes, and maximize Data Parallelism. If the model doesn't fit memory, shard it using model parallelism techniques until it does, then use Data Parallelism for the rest. Use gradient accumulation if batch size is too small.

How do modern large-scale training runs like LLaMA 3 utilize these parallelism strategies?

LLaMA 3 uses a combination of Tensor Parallelism for dense layers, Context Parallelism for long sequence extension, controlled Pipeline Parallelism, and significant Data Parallelism. For MoE models like Mixtral, Expert Parallelism is prioritized.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

Stanford Online

Education6 min read81 min video

Apr 28, 2026|896 views|36

Stanford Stanford Online AI Artificial Intelligence

Save to Pod

Key Moments

TL;DR

Training modern LLMs requires complex "4D" parallelism, combining data, tensor, pipeline, and expert strategies to overcome compute and memory bottlenecks across vast GPU clusters. However, this complexity comes at the cost of intricate system engineering and sophisticated hardware.

Key Insights

To train massive language models, parallelism is essential due to compute and memory bottlenecks, requiring strategies to distribute models across multiple machines and GPUs.

Data parallelism splits the batch across GPUs but does not reduce memory usage, as each GPU must store a full copy of the model, gradients, and optimizer states.

Zero Redundancy Optimizer (ZeRO) stages offer memory savings by sharding optimizer states (Stage 1), gradients (Stage 2), and parameters (Stage 3), with Stage 3 (FSDP) achieving significant memory reduction by fetching parameters on demand.

Model parallelism strategies like pipeline (layer-wise) and tensor parallelism (matrix splitting) further break down the model to fit in memory, trading off increased communication for reduced memory footprints.

Expert parallelism, used for Mixture-of-Experts (MoE) models, shards experts across devices, offering an alternative to tensor parallelism with advantages in routing sparsity and avoiding small matrix multiplications.

Achieving high GPU utilization in large-scale training involves a combination of parallelism strategies (data, tensor, pipeline, expert, sequence), with specific rules of thumb for composing them based on network topology and model architecture, often requiring sophisticated system-level optimizations.

Addressing compute and memory bottlenecks with parallelism

The training of massive language models faces fundamental limitations in compute power and memory capacity, necessitating distributed training across numerous GPUs and machines. The primary bottlenecks are the sheer amount of computation required, which exceeds single-chip capabilities, and the enormous model sizes that cannot fit into the memory of a single GPU. This lecture delves into various parallelization strategies designed to distribute both computation and memory requirements, distinguishing between intra-node parallelism (fast, high-bandwidth connections) and inter-node parallelism (slower, more limited connections). These strategies rely on collective communication primitives like all-reduce, all-gather, and reduce-scatter, which form the algorithmic basis for distributed operations.

Hardware philosophies and their impact on parallelism

Different hardware architectures present distinct networking philosophies that influence parallelization choices. Google's TPUs often employ a toroidal mesh topology, excelling at neighbor-to-neighbor communication and simplifying scaling. This makes them well-suited for predictable communication patterns in dense models. In contrast, NVIDIA's GPUs typically utilize a fat-tree or all-to-all philosophy, offering more flexibility for unpredictable communication, as seen in models like Mixture-of-Experts (MoE) where tokens might route to different experts. Recent hardware developments, such as Google's TPU v8 and its Virgo network, show a convergence toward more all-to-all connectivity, reflecting the evolving needs of large-scale model training and serving, where flexible bandwidth becomes critical.

Data parallelism and its memory limitations

Data parallelism is the most intuitive strategy, where a large batch is divided across multiple GPUs, with each GPU processing a subset of the data. While this scales compute efficiently, it offers no memory savings because each GPU must still store a complete replica of the model parameters, gradients, and importantly, the optimizer states. The memory required for optimizer states, especially in optimizers like Adam, can significantly exceed the memory needed for parameters and gradients themselves, often requiring multiples of the parameter size (e.g., 5x the parameter memory). This complete replication across GPUs makes naive data parallelism impractical for large models.

ZeRO and memory optimization through sharding

The Zero Redundancy Optimizer (ZeRO) tackles the memory limitations of data parallelism by sharding different components of the training state across GPUs. ZeRO Stage 1 shards only the optimizer states, reducing memory by a factor of N (number of GPUs). Stage 2 further shards the gradients, while Stage 3 (Fully Sharded Data Parallel - FSDP) shards all three: optimizer states, gradients, and model parameters. In FSDP, each GPU only materializes the parameters for the layers it's currently processing, gathering them on demand and freeing them afterward. This significantly reduces per-GPU memory usage, enabling larger models to fit. While Stages 1 and 2 offer communication costs equivalent to naive data parallelism (one all-reduce), FSDP introduces additional communication primitives (reduce-scatter, all-gather) but can achieve near-zero overhead by overlapping communication with computation, making it highly efficient.

Model parallelism: Pipeline and Tensor parallelism

Model parallelism strategies break down the model itself across GPUs. Pipeline parallelism (or layer-wise parallelism) splits the model's layers, assigning consecutive layers to different GPUs. This reduces activation memory but introduces significant "pipeline bubbles"—idle time where GPUs wait for data from previous stages. These bubbles can be mitigated by micro-batching, effectively increasing the batch size to keep more GPUs busy. Tensor parallelism (or matrix parallelism) splits individual weight matrices within layers (e.g., in linear layers and attention mechanisms) across GPUs. This strategy incurs communication costs on every matrix multiplication but avoids pipeline bubbles and can effectively reduce activation memory by slicing it across the tensor parallel dimension. Tensor parallelism is best suited for high-bandwidth, low-latency interconnects like NVLink within a node, as inter-node communication can severely degrade performance.

Expert parallelism and sequence/context parallelism

Expert parallelism is another model parallelism strategy, particularly relevant for Mixture-of-Experts (MoE) models. It involves sharding the individual 'experts' (often the feed-forward networks within a transformer block) across different devices. This can be more efficient than tensor parallelism for MoE models, as it allows for sparse routing of tokens and avoids the overhead of small matrix multiplications seen in fine-grained tensor parallelism. Sequence or Context Parallelism, often used for very long sequences, splits activations along the sequence dimension. This is conceptually similar to FSDP in that it materializes necessary activations on demand by sharding them across accelerators, typically using all-gather and reduce-scatter operations. It's particularly useful for extending context length in models during training.

Composing parallelism strategies for optimal utilization

No single parallelism strategy is universally optimal. Large-scale training requires a sophisticated combination of these techniques, often referred to as 3D or 4D parallelism. The general prescription involves using data parallelism to maximize GPU utilization across the largest number of devices. Model parallelism (tensor, expert, pipeline) is then employed to reduce the memory footprint per GPU until the model fits. Tensor and expert parallelism are typically confined to fast intra-node interconnects, while pipeline parallelism is used for inter-node communication. Sequence parallelism further aids in memory reduction for long contexts. Optimizing this composite strategy involves intricate system-level engineering, balancing compute, communication, and memory constraints to achieve high GPU utilization, often drawing on detailed performance analysis and empirical guidelines from large-scale training runs and hardware specifications.

Practical insights from large-scale training configurations

Analysis of recent large-scale training runs reveals common patterns and emerging best practices. For instance, models like LLaMA 3 employ a multi-stage parallelization strategy, using tensor parallelism for dense layers and context parallelism for long-context extension. MoE models like DeepSeek v3 and Mixtral leverage expert parallelism, often combined with pipeline and tensor parallelism, demonstrating the effectiveness of sharding experts. Studies and NVIDIA's parallelism guidelines highlight trade-offs: staying within fast intra-node interconnects for tensor/expert parallelism, using pipeline parallelism for inter-node links, and maximizing data parallelism. The effectiveness of these strategies is underscored by their ability to maintain high GPU utilization even with massive clusters, though it necessitates significant system complexity and careful tuning.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Parallelism Strategies Cheat Sheet

Practical takeaways from this episode

Do This

Use Tensor/Expert Parallel within fast interconnects (single node).

Use Pipeline Parallel for slower, multi-node communication.

Leverage FSDP (Zero Stage 3) to shard parameters, gradients, and optimizer states for memory efficiency.

Maximize Data Parallelism by using most GPUs for data sharding.

When model doesn't fit in memory, use Tensor/Expert Parallel and/or Pipeline Parallel/FSDP.

If batch size is too small, use gradient accumulation for better GPU utilization.

Combine parallelism strategies (3D/4D parallelism) to keep compute units fully utilized.

Avoid This

Don't excessively split matrices in Tensor Parallelism if it significantly reduces GPU utilization.

Avoid relying solely on parameter sharding (Zero Stages 1 & 2) as activation memory is a major bottleneck.

Do not expect perfect utilization from naive Pipeline Parallelism due to bubbles; requires large batch sizes or advanced techniques (e.g., zero-bubble pipelining).

Don't assume one parallelism strategy dominates; understand the trade-offs.

Avoid using Tensor Parallelism beyond fast interconnects (typically 8 GPUs per node).

Be aware that Expert Parallelism still requires careful systems implementation and optimization.

Comparison of Parallelism Strategies

Data extracted from this episode

Strategy	Primary Goal	Memory Savings	Communication Cost	Key Advantage	Key Disadvantage
Data Parallelism	Scale Compute	None (Replicates model)	All-reduce gradients (2x parameters)	Simple, scales compute	No memory scaling, limited by batch size
Zero Stage 1 (Optimizer State Sharding)	Memory Scaling	Significant (Optimizer state)	Same as DDP (equiv. to All-reduce)	Free memory savings	Doesn't scale parameter/gradient memory
Zero Stage 2 (Gradient & Optimizer State Sharding)	Memory Scaling	Significant (Gradients + Optimizer state)	Same as DDP (equiv. to All-reduce)	More memory savings	Still doesn't scale parameter memory
Zero Stage 3 / FSDP (Full Sharding)	Memory Scaling	High (Parameters, Gradients, Optimizer state)	Slightly more than DDP (2 All-gather + Reduce-scatter)	Massive memory reduction, potentially overhead-free with overlap	Complex, requires communication overlap
Pipeline Parallelism	Memory Scaling	High (Splits layers)	Point-to-point (Activation size: BSH)	Good for slowest links, memory efficient	Pipeline bubbles (idle time), sensitive to batch size
Tensor Parallelism	Memory Scaling	High (Splits matrices)	Activation size (All-reduce, activation-level)	No pipeline bubble, low complexity	High communication overhead, best within node
Expert Parallelism	Memory Scaling	High (Splits experts in MoE)	All-to-all dispatches (Latency sensitive)	Efficient for MoE, reduces activation memory	Complex implementation, latency sensitive
Sequence/Context Parallelism	Memory Scaling	High (Splits activations by sequence)	All-gathers/Reduce-scatters	Reduces activation memory linearly	Reminiscent of FSDP, needs careful management

Common Questions

The primary bottlenecks are compute power and memory capacity. Large language models require more computational resources than a single chip can provide and their parameters often exceed the memory of a single GPU, forcing us to distribute the workload across multiple machines and GPUs.

Topics

Memory Optimization Human Performance AI & Machine Learning Technology & Innovation Programming & Software Distributed Systems Compute Scaling Parallel Computing Language Model Training Hardware Architectures Communication Primitives Model Parallelism Strategies

Mentioned in this video

Companies

Google

Developer of TPUs and announced TPU AI / TPU8. Also associated with Gemini 2 model.

NVIDIA

Manufacturer of GPUs and Megatron parallelism library. Mentioned for their H200 and other systems.

DeepSeek

A company known for its systems skills in AI, developed DPP library for expert parallelism, and trained models like DeepSeek v1 and v3.

Products

TPU

Google's accelerator, characterized by its toroidal mesh networking, suitable for dense models and predictable partitions.

TPU8

A TPU model with a tree topology and an all-to-all connection philosophy, suitable for modern language models and training.

Huawei Ascend 910

A Chinese AI chip with a large number of interconnected chips, offering high scalability at the cost of high power consumption.

H200

A GPU model mentioned as a point of comparison for the Huawei Ascend 910's performance.

Software & Apps

GPU

Graphics Processing Unit, described as lightweight compared to TPUs, with a fat tree networking philosophy and suitability for mixture of experts models.

Virgo network

A higher-level networking stack mentioned in the context of cross-rack connectivity for TPU8.

NVLink

A high-speed interconnect, recommended for keeping expert and tensor parallelism within a single machine.

Adam

An optimization algorithm that requires tracking first and second moments of gradients, leading to significant memory overhead.

DDP

Naive Data Parallelism, used as a baseline for comparison with Zero stages, characterized by replicating model copies across GPUs.

FSDP

Fully Sharded Data Parallelism, a memory-saving technique that shards parameters, gradients, and optimizer states across GPUs, often used in PyTorch.

PyTorch

A deep learning framework mentioned in relation to FSDP implementation and profiling tools.

DPP

DeepSeek's library for expert parallelism, focusing on low-level GPU networking primitives for efficient routing and dispatching.

PTX

GPU machine code, mentioned that DeepSeek used undocumented PTX instructions to accelerate networking communication.

Megatron

NVIDIA's parallelism library and associated guidance document, used as a reference for best practices in parallelism strategies.

OLO

A 7B parameter model trained on the DOMA dataset using FSDP.

A model trained in the Chinese open-weights setting, using Zero Stage 1, tensor, and pipeline parallelism.

Llama 3

A large dense model for which NVIDIA reported detailed parallelism strategies across different training phases, including long context extension.

Gemini 2

A Google open-source model that uses FSDP, tensor parallel, and sequence parallel, realized on Google's TPU mesh.

Mixtral AX22B

A large mixture-of-experts model where parallelism configurations were investigated, using expert, pipeline, and tensor parallelism.

NVIDIA Bridge

A repository from NVIDIA that provides recommended training configurations for various model sizes and settings.

Quen 3

A model that follows the DeepSeek recipe for parallelism, utilizing a large amount of expert parallel, pipeline parallel, and tensor parallel.

People

Jeff Dean

Mentioned alongside Billy discussing hardware for different workloads.

Concepts

Mixture-of-Experts

A model architecture where communication is less clear as tokens route to different experts, well-suited for GPUs.

Organizations

AI2

Allen Institute for AI, developer of the DOMA model and dataset.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free