What are collective operations in distributed programming?

Collective operations are predefined communication patterns (like broadcast, all-reduce, scatter, gather) that simplify parallel programming by specifying how multiple devices (ranks) exchange data, rather than managing point-to-point communication.

What is the difference between All-Reduce and Reduce-Scatter?

All-Reduce sums data across all ranks and replicates the result on every rank. Reduce-Scatter performs a reduction on each dimension of the tensors and scatters these reduced results to different ranks.

How are GPUs connected in a high-performance computing cluster?

Within a node, GPUs are typically connected via high-speed interconnects like NVLink. For multi-node communication, Infiniband is commonly used, and for very large clusters, Ethernet might be employed, though with lower performance.

How does PyTorch's torch.distributed library facilitate parallel training?

The library provides a clean interface for collective operations, abstracting away low-level details like NCCL or GLUE backends, and supports various distributed training strategies.

What is the core idea behind Data Parallelism (DDP)?

In DDP, each GPU holds a full copy of the model's parameters. The input data is sharded across GPUs, and after the backward pass, gradients are averaged across all GPUs using an AllReduce operation to ensure consistent parameter updates.

How does Tensor Parallelism differ from Data Parallelism?

Tensor Parallelism splits individual layers (tensors) of the model across different GPUs. This requires more communication within layers but allows for models that are too large to fit on a single GPU.

What is Pipeline Parallelism and what are its main challenges?

Pipeline Parallelism assigns different layers of the model to different GPUs, forming a pipeline. The main challenge is 'pipeline bubbles,' where GPUs are idle waiting for data from the previous stage, which can be mitigated by using micro-batches.

When would you choose Tensor Parallelism over Data Parallelism?

Tensor parallelism is preferred when the model's parameters or activations are too large to fit on a single GPU, even if the batch size is small. It requires high-bandwidth interconnects like NVLink.

Can you combine different parallelism techniques?

Yes, it's common to combine techniques. For instance, using tensor parallelism within a node and data or pipeline parallelism across nodes to scale to very large clusters.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism

Stanford Online

Education5 min read82 min video

Apr 28, 2026|255 views|10|2

Stanford Stanford Online AI Artificial Intelligence

Save to Pod

Key Moments

TL;DR

Training massive language models requires techniques like data, tensor, and pipeline parallelism to overcome GPU memory and speed limitations, but each introduces communication overhead.

Key Insights

Multi-GPU training is necessary when model parameters, activations, or gradients exceed a single GPU's HBM (e.g., B200's 192GB for a trillion-parameter model) or to accelerate training.

Collective operations like `all-reduce`, `reduce-scatter`, and `all-gather` are fundamental primitives for distributed communication, originating from 1980s parallel programming.

Data parallelism (DDP) splits data across GPUs, synchronizing gradients via `all-reduce`; it's elegant but requires holding all model parameters on each GPU.

Tensor parallelism splits model layers across GPUs, requiring frequent communication of activations, and is best suited for high-bandwidth interconnects like NVLink.

Pipeline parallelism splits layers sequentially across GPUs, reducing inter-GPU communication but introducing pipeline bubbles (idle time) that micro-batching and communication-computation overlap aim to mitigate.

Interconnect speeds follow a hierarchy: on-chip memory > HBM > NVLink > InfiniBand > Ethernet, with RDMA enabling direct GPU-to-GPU communication bypassing the CPU for faster links.

The necessity of multi-GPU parallelism

Training increasingly large language models necessitates moving beyond single-GPU capabilities. The primary drivers for employing multiple GPUs fall into two categories: memory constraints and speed requirements. For instance, a trillion-parameter model, even with a B200 GPU offering 192GB of High Bandwidth Memory (HBM), will not fit entirely on a single device due to its parameters, activations, and optimizer states. Furthermore, even if a model could theoretically fit, distributing it across multiple GPUs can significantly reduce training time. This introduces a trade-off: spreading computation allows for faster training but incurs communication overheads, requiring careful orchestration to avoid bottlenecks.

Core building blocks: Collective communication operations

At the heart of distributed training lie collective operations, a set of primitives established in the 1980s for parallel programming. These operations define general communication patterns across multiple devices (ranks, typically GPUs). Key operations include: `broadcast` (one rank sends data to all others), `scatter` (one rank splits data and sends parts to others), `gather` (multiple ranks send data to one), `reduce` (multiple ranks send data to one for a combined operation like sum), `all-gather` (all ranks send data to all others), `reduce-scatter` (combines reduce and scatter), `all-reduce` (reduces data across all ranks and replicates the result on all), and `all-to-all` (each rank sends specific data to each other rank, crucial for certain architectures like Mixture-of-Experts). While `broadcast`, `scatter`, and `gather` serve as conceptual warm-ups, `all-gather`, `reduce-scatter`, and `all-reduce` are the workhorses for distributed training.

Data parallelism: Splitting the data, synchronizing gradients

Data parallelism (DP) is a straightforward approach where the training data is partitioned across multiple GPUs. Each GPU receives a subset of the data batch and performs a forward and backward pass using a full replica of the model. The crucial step is synchronizing the gradients computed on each GPU. This is achieved using the `all-reduce` collective operation, which sums the gradients from all GPUs and distributes the averaged result back to each. This ensures that all model replicas are updated identically. This method is elegant, requiring minimal model code modification, typically just a one-line change after the backward pass. However, it necessitates that each GPU holds the entire model's parameters, optimizer states, and gradients in its memory. This approach is particularly effective when the batch size is sufficiently large (ideally a multiple of the world size) and the model fits within a single GPU's memory.

Tensor parallelism: Splitting weights within layers

Tensor parallelism (TP) takes a different approach by splitting the model's layers themselves across GPUs. Instead of replicating the entire model, each GPU holds only a portion of the parameters for each layer. This commonly involves splitting the weight matrices column-wise (or row-wise). During the forward pass, each GPU computes its part of the matrix multiplication. To combine these partial results, communication is required. Specifically, an `all-gather` operation is often used on the activations to reconstruct the full activation tensor before passing it to the next layer or operation. In the backward pass, a `reduce-scatter` operation is typically employed to distribute the computed gradients. TP introduces significant communication, making it most effective within a single node where high-bandwidth interconnects like NVLink are available, allowing for rapid exchange of large activation tensors.

Pipeline parallelism: Staging layers sequentially

Pipeline parallelism (PP) divides the model by assigning sequential layers to different GPUs. Each GPU processes a specific stage of the network. To mitigate the inefficiency caused by GPUs sitting idle while waiting for data from the previous stage (known as pipeline bubbles), the input batch is further divided into smaller 'micro-batches'. This allows GPUs to process micro-batches concurrently, enabling a pipeline effect where multiple micro-batches are in flight across different stages. While PP can tolerate slower interconnects compared to TP, effectively managing pipeline bubbles and overlapping communication with computation are critical for achieving good performance. This often involves point-to-point send/receive operations between stages and careful scheduling to maximize GPU utilization.

Hardware interconnects and communication optimization

The performance of distributed training is heavily influenced by the hardware interconnects between GPUs. The hierarchy of speeds ranges from on-chip caches (fastest) to HBM, NVLink (within a node, high bandwidth), InfiniBand (between nodes, supporting RDMA), and Ethernet (slowest, often requiring CPU involvement unless RDMA over Converged Ethernet, RoCE, is used). RDMA (Remote Direct Memory Access) is a key technology allowing GPUs to directly access memory on other GPUs without CPU intervention, significantly reducing latency. Strategies like NVLink switches and InfiniBand are standard in high-performance clusters. Innovations like NVIDIA's NVLink 72 aim to increase the number of tightly coupled GPUs within a domain. The choice of parallelism strategy is often dictated by this hardware; TP benefits from NVLink, while PP can be more tolerant of slower links, even across geographically dispersed nodes.

Bridging parallelism and implementation in PyTorch

PyTorch's `torch.distributed` library provides a high-level interface to these collective operations, abstracting away low-level details handled by libraries like NCCL (NVIDIA Collective Communications Library) for GPUs or Gloo for CPUs. While the lecture focuses on building from scratch to understand mechanisms, practical implementations often leverage these libraries. The lecture demonstrates basic collectives and then applies DDP to an MLP, highlighting the simple insertion of `all_reduce` for gradient averaging. Tensor and pipeline parallelism are also illustrated with an MLP, emphasizing the need for manual handling of communication primitives like `all-gather` and `reduce-scatter` in these schemes. The course emphasizes understanding these fundamental techniques before moving to more complex frameworks or compiler-driven parallelism.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Concepts

Mastering Parallelism in Large Model Training

Practical takeaways from this episode

Do This

Understand the trade-offs between data, tensor, and pipeline parallelism based on hardware interconnects.

Leverage collective operations (All Reduce, All Gather, Reduce Scatter) for efficient inter-GPU communication.

Optimize for communication bandwidth and latency by choosing appropriate hardware and communication strategies.

Consider micro-batches in pipeline parallelism to mitigate pipeline bubbles and improve GPU utilization.

Overlap computation and communication where possible to maximize throughput.

Avoid This

Do not attempt tensor parallelism over slow interconnects (e.g., standard Ethernet).

Avoid relying solely on point-to-point communication; use collective operations for efficiency.

Do not ignore hardware topology when designing your parallelism strategy.

Avoid creating pipeline bubbles by not breaking the batch into micro-batches.

Do not forget synchronization barriers when managing asynchronous operations across multiple processes.

Common Questions

The main challenge is fitting the model's parameters, activations, and optimizer states into the memory of a single GPU. Training very large models also necessitates distributing the computation across multiple GPUs to achieve reasonable training times.

Topics

AI & Machine Learning Technology & Innovation Programming & Software Large Language Models Deep Learning Distributed Systems GPU Computing Parallel Programming High- Performance Computing

Mentioned in this video

Companies

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free