Key Moments
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism
Key Moments
Training massive language models requires techniques like data, tensor, and pipeline parallelism to overcome GPU memory and speed limitations, but each introduces communication overhead.
Key Insights
Multi-GPU training is necessary when model parameters, activations, or gradients exceed a single GPU's HBM (e.g., B200's 192GB for a trillion-parameter model) or to accelerate training.
Collective operations like `all-reduce`, `reduce-scatter`, and `all-gather` are fundamental primitives for distributed communication, originating from 1980s parallel programming.
Data parallelism (DDP) splits data across GPUs, synchronizing gradients via `all-reduce`; it's elegant but requires holding all model parameters on each GPU.
Tensor parallelism splits model layers across GPUs, requiring frequent communication of activations, and is best suited for high-bandwidth interconnects like NVLink.
Pipeline parallelism splits layers sequentially across GPUs, reducing inter-GPU communication but introducing pipeline bubbles (idle time) that micro-batching and communication-computation overlap aim to mitigate.
Interconnect speeds follow a hierarchy: on-chip memory > HBM > NVLink > InfiniBand > Ethernet, with RDMA enabling direct GPU-to-GPU communication bypassing the CPU for faster links.
The necessity of multi-GPU parallelism
Training increasingly large language models necessitates moving beyond single-GPU capabilities. The primary drivers for employing multiple GPUs fall into two categories: memory constraints and speed requirements. For instance, a trillion-parameter model, even with a B200 GPU offering 192GB of High Bandwidth Memory (HBM), will not fit entirely on a single device due to its parameters, activations, and optimizer states. Furthermore, even if a model could theoretically fit, distributing it across multiple GPUs can significantly reduce training time. This introduces a trade-off: spreading computation allows for faster training but incurs communication overheads, requiring careful orchestration to avoid bottlenecks.
Core building blocks: Collective communication operations
At the heart of distributed training lie collective operations, a set of primitives established in the 1980s for parallel programming. These operations define general communication patterns across multiple devices (ranks, typically GPUs). Key operations include: `broadcast` (one rank sends data to all others), `scatter` (one rank splits data and sends parts to others), `gather` (multiple ranks send data to one), `reduce` (multiple ranks send data to one for a combined operation like sum), `all-gather` (all ranks send data to all others), `reduce-scatter` (combines reduce and scatter), `all-reduce` (reduces data across all ranks and replicates the result on all), and `all-to-all` (each rank sends specific data to each other rank, crucial for certain architectures like Mixture-of-Experts). While `broadcast`, `scatter`, and `gather` serve as conceptual warm-ups, `all-gather`, `reduce-scatter`, and `all-reduce` are the workhorses for distributed training.
Data parallelism: Splitting the data, synchronizing gradients
Data parallelism (DP) is a straightforward approach where the training data is partitioned across multiple GPUs. Each GPU receives a subset of the data batch and performs a forward and backward pass using a full replica of the model. The crucial step is synchronizing the gradients computed on each GPU. This is achieved using the `all-reduce` collective operation, which sums the gradients from all GPUs and distributes the averaged result back to each. This ensures that all model replicas are updated identically. This method is elegant, requiring minimal model code modification, typically just a one-line change after the backward pass. However, it necessitates that each GPU holds the entire model's parameters, optimizer states, and gradients in its memory. This approach is particularly effective when the batch size is sufficiently large (ideally a multiple of the world size) and the model fits within a single GPU's memory.
Tensor parallelism: Splitting weights within layers
Tensor parallelism (TP) takes a different approach by splitting the model's layers themselves across GPUs. Instead of replicating the entire model, each GPU holds only a portion of the parameters for each layer. This commonly involves splitting the weight matrices column-wise (or row-wise). During the forward pass, each GPU computes its part of the matrix multiplication. To combine these partial results, communication is required. Specifically, an `all-gather` operation is often used on the activations to reconstruct the full activation tensor before passing it to the next layer or operation. In the backward pass, a `reduce-scatter` operation is typically employed to distribute the computed gradients. TP introduces significant communication, making it most effective within a single node where high-bandwidth interconnects like NVLink are available, allowing for rapid exchange of large activation tensors.
Pipeline parallelism: Staging layers sequentially
Pipeline parallelism (PP) divides the model by assigning sequential layers to different GPUs. Each GPU processes a specific stage of the network. To mitigate the inefficiency caused by GPUs sitting idle while waiting for data from the previous stage (known as pipeline bubbles), the input batch is further divided into smaller 'micro-batches'. This allows GPUs to process micro-batches concurrently, enabling a pipeline effect where multiple micro-batches are in flight across different stages. While PP can tolerate slower interconnects compared to TP, effectively managing pipeline bubbles and overlapping communication with computation are critical for achieving good performance. This often involves point-to-point send/receive operations between stages and careful scheduling to maximize GPU utilization.
Hardware interconnects and communication optimization
The performance of distributed training is heavily influenced by the hardware interconnects between GPUs. The hierarchy of speeds ranges from on-chip caches (fastest) to HBM, NVLink (within a node, high bandwidth), InfiniBand (between nodes, supporting RDMA), and Ethernet (slowest, often requiring CPU involvement unless RDMA over Converged Ethernet, RoCE, is used). RDMA (Remote Direct Memory Access) is a key technology allowing GPUs to directly access memory on other GPUs without CPU intervention, significantly reducing latency. Strategies like NVLink switches and InfiniBand are standard in high-performance clusters. Innovations like NVIDIA's NVLink 72 aim to increase the number of tightly coupled GPUs within a domain. The choice of parallelism strategy is often dictated by this hardware; TP benefits from NVLink, while PP can be more tolerant of slower links, even across geographically dispersed nodes.
Bridging parallelism and implementation in PyTorch
PyTorch's `torch.distributed` library provides a high-level interface to these collective operations, abstracting away low-level details handled by libraries like NCCL (NVIDIA Collective Communications Library) for GPUs or Gloo for CPUs. While the lecture focuses on building from scratch to understand mechanisms, practical implementations often leverage these libraries. The lecture demonstrates basic collectives and then applies DDP to an MLP, highlighting the simple insertion of `all_reduce` for gradient averaging. Tensor and pipeline parallelism are also illustrated with an MLP, emphasizing the need for manual handling of communication primitives like `all-gather` and `reduce-scatter` in these schemes. The course emphasizes understanding these fundamental techniques before moving to more complex frameworks or compiler-driven parallelism.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Concepts
Mastering Parallelism in Large Model Training
Practical takeaways from this episode
Do This
Avoid This
Common Questions
The main challenge is fitting the model's parameters, activations, and optimizer states into the memory of a single GPU. Training very large models also necessitates distributing the computation across multiple GPUs to achieve reasonable training times.
Topics
Mentioned in this video
The parent company of Facebook, mentioned in the context of exploring RDMA over converged Ethernet for large-scale training.
Manufacturer of GPUs and related technologies like NVLink and NCCL, crucial for high-performance computing and AI training architectures.
A backend for PyTorch distributed operations that targets CPUs, allowing for parallel processing even without GPUs.
A distributed training technique where model layers are split across devices, requiring more communication but enabling larger models.
A collective communication operation where a tensor at one rank is split and distributed to other ranks.
A distributed training technique where the model's parameters are replicated across multiple devices, and the data is sharded among them.
A collective communication operation where data from all ranks is reduced (e.g., summed) and the result is distributed to all ranks.
A collective communication operation that performs a reduction on tensors across ranks and then scatters the results to different ranks.
A distributed training technique where different stages (layers) of the model are placed on different devices, processed sequentially with micro-batches.
A PyTorch implementation of data parallelism where gradients are averaged across all processes after the backward pass.
A collective communication operation where each rank sends a specific message to every other rank, useful for dynamic routing and expert parallelism.
ZeRO (Zero Redundancy Optimizer), a memory optimization technique for distributed training that partitions optimizer states and gradients.
A collective communication operation where each rank sends its data to all other ranks, resulting in each rank holding the complete gathered data.
A collective communication operation where one rank sends its data to all other ranks.
A collective communication operation where pieces of data from multiple ranks are concatenated and sent to a specific rank.
A collective communication operation that applies an associative and commutative operation (like sum, max, min) to data across ranks, with the result on one rank.
A GPU model with 192 GB of HBM memory, mentioned as a reference point for model size limitations on single GPUs.
A high-performance interconnect technology used for connecting nodes in large-scale distributed computing clusters, supporting RDMA.
A standard networking technology used for connecting computers, but with lower bandwidth compared to NVLink or Infiniband for multi-node GPU communication.
Tensor Processing Unit, a specialized hardware accelerator developed by Google for machine learning, contrasted with NVIDIA GPUs in terms of architecture and programming.
A popular open-source machine learning framework used for implementing distributed training strategies and collective operations.
NVIDIA's high-speed interconnect technology for connecting GPUs within a server node, enabling high bandwidth communication.
NVIDIA's Collective Communications Library, which implements optimized collective operations for multi-GPU and multi-node communication.
Peripheral Component Interconnect Express, a high-speed interface used for connecting components like GPUs to the CPU within a server node.
A large language model, possibly trained using RDMA over converged Ethernet, indicating advancements in distributed training infrastructure.
PyTorch's library providing interfaces for distributed communication primitives and higher-level distributed training modules.
NVIDIA's optimized library for high-performance collective communication operations between GPUs, crucial for large-scale deep learning training.
More from Stanford Online
View all 32 summaries
101 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance
81 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism
87 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA
72 minStanford CS25: Transformers United V6 I From Representation Learning to World Modeling
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free