Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism

Stanford OnlineStanford Online
Education5 min read82 min video
Apr 28, 2026|255 views|10|2
Save to Pod
TL;DR

Training massive language models requires techniques like data, tensor, and pipeline parallelism to overcome GPU memory and speed limitations, but each introduces communication overhead.

Key Insights

1

Multi-GPU training is necessary when model parameters, activations, or gradients exceed a single GPU's HBM (e.g., B200's 192GB for a trillion-parameter model) or to accelerate training.

2

Collective operations like `all-reduce`, `reduce-scatter`, and `all-gather` are fundamental primitives for distributed communication, originating from 1980s parallel programming.

3

Data parallelism (DDP) splits data across GPUs, synchronizing gradients via `all-reduce`; it's elegant but requires holding all model parameters on each GPU.

4

Tensor parallelism splits model layers across GPUs, requiring frequent communication of activations, and is best suited for high-bandwidth interconnects like NVLink.

5

Pipeline parallelism splits layers sequentially across GPUs, reducing inter-GPU communication but introducing pipeline bubbles (idle time) that micro-batching and communication-computation overlap aim to mitigate.

6

Interconnect speeds follow a hierarchy: on-chip memory > HBM > NVLink > InfiniBand > Ethernet, with RDMA enabling direct GPU-to-GPU communication bypassing the CPU for faster links.

The necessity of multi-GPU parallelism

Training increasingly large language models necessitates moving beyond single-GPU capabilities. The primary drivers for employing multiple GPUs fall into two categories: memory constraints and speed requirements. For instance, a trillion-parameter model, even with a B200 GPU offering 192GB of High Bandwidth Memory (HBM), will not fit entirely on a single device due to its parameters, activations, and optimizer states. Furthermore, even if a model could theoretically fit, distributing it across multiple GPUs can significantly reduce training time. This introduces a trade-off: spreading computation allows for faster training but incurs communication overheads, requiring careful orchestration to avoid bottlenecks.

Core building blocks: Collective communication operations

At the heart of distributed training lie collective operations, a set of primitives established in the 1980s for parallel programming. These operations define general communication patterns across multiple devices (ranks, typically GPUs). Key operations include: `broadcast` (one rank sends data to all others), `scatter` (one rank splits data and sends parts to others), `gather` (multiple ranks send data to one), `reduce` (multiple ranks send data to one for a combined operation like sum), `all-gather` (all ranks send data to all others), `reduce-scatter` (combines reduce and scatter), `all-reduce` (reduces data across all ranks and replicates the result on all), and `all-to-all` (each rank sends specific data to each other rank, crucial for certain architectures like Mixture-of-Experts). While `broadcast`, `scatter`, and `gather` serve as conceptual warm-ups, `all-gather`, `reduce-scatter`, and `all-reduce` are the workhorses for distributed training.

Data parallelism: Splitting the data, synchronizing gradients

Data parallelism (DP) is a straightforward approach where the training data is partitioned across multiple GPUs. Each GPU receives a subset of the data batch and performs a forward and backward pass using a full replica of the model. The crucial step is synchronizing the gradients computed on each GPU. This is achieved using the `all-reduce` collective operation, which sums the gradients from all GPUs and distributes the averaged result back to each. This ensures that all model replicas are updated identically. This method is elegant, requiring minimal model code modification, typically just a one-line change after the backward pass. However, it necessitates that each GPU holds the entire model's parameters, optimizer states, and gradients in its memory. This approach is particularly effective when the batch size is sufficiently large (ideally a multiple of the world size) and the model fits within a single GPU's memory.

Tensor parallelism: Splitting weights within layers

Tensor parallelism (TP) takes a different approach by splitting the model's layers themselves across GPUs. Instead of replicating the entire model, each GPU holds only a portion of the parameters for each layer. This commonly involves splitting the weight matrices column-wise (or row-wise). During the forward pass, each GPU computes its part of the matrix multiplication. To combine these partial results, communication is required. Specifically, an `all-gather` operation is often used on the activations to reconstruct the full activation tensor before passing it to the next layer or operation. In the backward pass, a `reduce-scatter` operation is typically employed to distribute the computed gradients. TP introduces significant communication, making it most effective within a single node where high-bandwidth interconnects like NVLink are available, allowing for rapid exchange of large activation tensors.

Pipeline parallelism: Staging layers sequentially

Pipeline parallelism (PP) divides the model by assigning sequential layers to different GPUs. Each GPU processes a specific stage of the network. To mitigate the inefficiency caused by GPUs sitting idle while waiting for data from the previous stage (known as pipeline bubbles), the input batch is further divided into smaller 'micro-batches'. This allows GPUs to process micro-batches concurrently, enabling a pipeline effect where multiple micro-batches are in flight across different stages. While PP can tolerate slower interconnects compared to TP, effectively managing pipeline bubbles and overlapping communication with computation are critical for achieving good performance. This often involves point-to-point send/receive operations between stages and careful scheduling to maximize GPU utilization.

Hardware interconnects and communication optimization

The performance of distributed training is heavily influenced by the hardware interconnects between GPUs. The hierarchy of speeds ranges from on-chip caches (fastest) to HBM, NVLink (within a node, high bandwidth), InfiniBand (between nodes, supporting RDMA), and Ethernet (slowest, often requiring CPU involvement unless RDMA over Converged Ethernet, RoCE, is used). RDMA (Remote Direct Memory Access) is a key technology allowing GPUs to directly access memory on other GPUs without CPU intervention, significantly reducing latency. Strategies like NVLink switches and InfiniBand are standard in high-performance clusters. Innovations like NVIDIA's NVLink 72 aim to increase the number of tightly coupled GPUs within a domain. The choice of parallelism strategy is often dictated by this hardware; TP benefits from NVLink, while PP can be more tolerant of slower links, even across geographically dispersed nodes.

Bridging parallelism and implementation in PyTorch

PyTorch's `torch.distributed` library provides a high-level interface to these collective operations, abstracting away low-level details handled by libraries like NCCL (NVIDIA Collective Communications Library) for GPUs or Gloo for CPUs. While the lecture focuses on building from scratch to understand mechanisms, practical implementations often leverage these libraries. The lecture demonstrates basic collectives and then applies DDP to an MLP, highlighting the simple insertion of `all_reduce` for gradient averaging. Tensor and pipeline parallelism are also illustrated with an MLP, emphasizing the need for manual handling of communication primitives like `all-gather` and `reduce-scatter` in these schemes. The course emphasizes understanding these fundamental techniques before moving to more complex frameworks or compiler-driven parallelism.

Mastering Parallelism in Large Model Training

Practical takeaways from this episode

Do This

Understand the trade-offs between data, tensor, and pipeline parallelism based on hardware interconnects.
Leverage collective operations (All Reduce, All Gather, Reduce Scatter) for efficient inter-GPU communication.
Optimize for communication bandwidth and latency by choosing appropriate hardware and communication strategies.
Consider micro-batches in pipeline parallelism to mitigate pipeline bubbles and improve GPU utilization.
Overlap computation and communication where possible to maximize throughput.

Avoid This

Do not attempt tensor parallelism over slow interconnects (e.g., standard Ethernet).
Avoid relying solely on point-to-point communication; use collective operations for efficiency.
Do not ignore hardware topology when designing your parallelism strategy.
Avoid creating pipeline bubbles by not breaking the batch into micro-batches.
Do not forget synchronization barriers when managing asynchronous operations across multiple processes.

Common Questions

The main challenge is fitting the model's parameters, activations, and optimizer states into the memory of a single GPU. Training very large models also necessitates distributing the computation across multiple GPUs to achieve reasonable training times.

Topics

Mentioned in this video

Concepts
Tensor Parallelism

A distributed training technique where model layers are split across devices, requiring more communication but enabling larger models.

Scatter

A collective communication operation where a tensor at one rank is split and distributed to other ranks.

Data Parallelism

A distributed training technique where the model's parameters are replicated across multiple devices, and the data is sharded among them.

All Reduce

A collective communication operation where data from all ranks is reduced (e.g., summed) and the result is distributed to all ranks.

Reduce Scatter

A collective communication operation that performs a reduction on tensors across ranks and then scatters the results to different ranks.

Pipeline Parallelism

A distributed training technique where different stages (layers) of the model are placed on different devices, processed sequentially with micro-batches.

Distributed Data Parallel

A PyTorch implementation of data parallelism where gradients are averaged across all processes after the backward pass.

All to All

A collective communication operation where each rank sends a specific message to every other rank, useful for dynamic routing and expert parallelism.

ZeRO

ZeRO (Zero Redundancy Optimizer), a memory optimization technique for distributed training that partitions optimizer states and gradients.

All Gather

A collective communication operation where each rank sends its data to all other ranks, resulting in each rank holding the complete gathered data.

Broadcast

A collective communication operation where one rank sends its data to all other ranks.

Gather

A collective communication operation where pieces of data from multiple ranks are concatenated and sent to a specific rank.

Reduce

A collective communication operation that applies an associative and commutative operation (like sum, max, min) to data across ranks, with the result on one rank.

More from Stanford Online

View all 32 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free