Key Moments

Stanford CS25: Transformers United V6 I The Ultra-Scale Talk: Scaling Training to Thousands of GPUs

Stanford OnlineStanford Online
Education5 min read62 min video
May 11, 2026|696 views|29|2
Save to Pod
TL;DR

Training LLMs on thousands of GPUs requires a complex blend of data, tensor, pipeline, sequence, and expert parallelism, each with trade-offs between memory, communication, and implementation complexity.

Key Insights

1

Large Language Models (LLMs) are scaling to trillions of parameters and trained on massive datasets (e.g., 15 trillion tokens) with context lengths up to 1 million.

2

Data Parallelism (DP) variants like ZeRO-1, ZeRO-2, and ZeRO-3 (FSDP) progressively reduce memory by sharding optimizer states, gradients, and model parameters, with ZeRO-3 offering the most memory savings at the cost of increased communication.

3

Tensor Parallelism (TP) shards model weights and computation across GPUs, enabling larger models but requiring significant communication (ALL_REDUCE) at each block and careful synchronization of parameters outside the TP region.

4

Sequence Parallelism (SP) is a flavor of TP that shards activations along the sequence dimension, reducing memory for activations and eliminating the need to synchronize LayerNorm gradients but still incurring high communication costs.

5

Pipeline Parallelism (PP) vertically shards layers across GPUs, reducing memory per GPU but introducing a 'pipeline bubble' (idle time) that requires complex schedulers (e.g., 1F1B, DualPipe) or many micro-batches to mitigate.

6

Expert Parallelism (EP) shards experts in Mixture-of-Experts (MoE) models, requiring ALL_TO_ALL communication for routing tokens and often combined with DP and PP to manage communication overhead and data parallelism requirements.

The imperative for scaling LLMs

The current trend in Large Language Models (LLMs) is relentless scaling, driven by observed correlations between model size and intelligence. Models now routinely reach trillions of parameters, trained on datasets of 15 trillion tokens with context lengths extending to 1 million. This necessitates massive computational infrastructure, placing immense pressure on GPUs for data loading, computation, and checkpointing within tight time constraints (e.g., 1 second per training iteration for a trillion-parameter model).

Mastering memory with ZeRO and data parallelism

Dealing with models that exceed single GPU memory requires distributed strategies. Data Parallelism (DP) is a foundational technique where data is sharded across GPUs, each holding a replica of the model and gradients. To synchronize gradients, the ALL_REDUCE collective operation is used. Efficient overlap of communication with computation, achieved by bucketing gradients, is crucial to minimize GPU idle time. However, standard DP duplicates optimizer states and can't address models larger than GPU memory. The ZeRO (Zero Redundancy Optimizer) family, including ZeRO-1, ZeRO-2, and ZeRO-3 (also known as Fully Sharded Data Parallelism or FSDP), progressively optimizes memory. ZeRO-1 shards optimizer states, ZeRO-2 further shards gradients, and ZeRO-3 shards model parameters as well. ZeRO-3 significantly reduces memory by ensuring each GPU only holds a shard of parameters, optimizer states, and gradients, allowing for training of massive models. The trick involves replacing ALL_REDUCE with REDUCE_SCATTER followed by ALL_GATHER, enabling distributed optimization steps.

Sharding models with tensor and sequence parallelism

When models are too large even for ZeRO strategies, Tensor Parallelism (TP) offers a solution by sharding the model's weights and computations across multiple GPUs. For instance, in matrix multiplications, weights can be split column-wise and row-wise, requiring ALL_REDUCE operations in both forward and backward passes to maintain correctness. This reduces the memory footprint and computation per GPU but introduces significant communication overhead at each block. A critical assumption for TP is that activations and gradients remain synchronized across GPUs outside the sharded dimension, which often requires gradient synchronization for parameters not within the TP region due to numerical drift. Sequence Parallelism (SP) further optimizes TP by sharding activations along the sequence dimension. This is particularly beneficial for long context training. SP leverages the REDUCE_SCATTER and ALL_GATHER primitives, similar to ZeRO, to shard activations and synchronize them, thereby reducing memory usage and eliminating the need for explicit LayerNorm gradient synchronization. However, like TP, SP remains communication-heavy.

Stitching layers with pipeline parallelism

Pipeline Parallelism (PP) vertically shards the model by distributing layers across different GPUs. GPU 1 handles initial layers, GPU 2 the next set, and so on. This significantly reduces the memory required per GPU as each only holds a fraction of the model layers. The primary challenge with PP is the 'pipeline bubble'—periods of GPU idle time as data moves between stages. To mitigate this, complex scheduling strategies like '1 forward, 1 backward' (1F1B) or more advanced methods like DualPipe are employed. These schedules interleave forward and backward passes for multiple micro-batches to keep GPUs busy. PP requires saving activations for multiple micro-batches, often necessitating activation checkpointing or CPU offloading to manage memory. It scales batch size with the number of micro-batches and is relatively communication-cheap, involving only activation and gradient exchanges between adjacent stages.

Handling experts in Mixture-of-Experts (MoE) with expert parallelism

For Mixture-of-Experts (MoE) models, Expert Parallelism (EP) is essential. It shards the experts across GPUs. This requires not only sharding the experts but also the data across GPUs (making it less orthogonal to DP) to ensure all experts can be utilized. The core communication primitive in EP is ALL_TO_ALL, used for dispatching tokens to their assigned experts and combining results. A significant bottleneck in EP is the CPU-GPU synchronization required for the router to compute expert assignments and buffer sizes, especially without specialized hardware like InfiniBand with IBGDA. This CPU-GPU sync can drastically slow down training. To address the communication overhead and CPU-GPU sync, EP is often combined with PP, interleaving the ALL_TO_ALL operations with forward/backward passes of other batches.

Combining parallelism strategies for ultra-scale training

Achieving ultra-scale training with thousands of GPUs necessitates combining these parallelism strategies. Data, Tensor, Pipeline, Sequence, and Expert Parallelism are designed to be orthogonal, allowing them to be composed into a 'parallelism mesh'. Frameworks like PyTorch (with FSDP, Tensor Parallelism utilities), Megatron-LM, and Nanotron abstract these complexities, enabling users to configure different parallelism dimensions. The choice of which strategies to employ, and to what degree, is highly dependent on hardware constraints (GPU memory, network interconnects like InfiniBand) and model characteristics (size, architecture, sequence length). For instance, TP is often confined to single nodes due to its communication intensity, while PP can span multiple nodes. The ultimate goal is to maximize hardware utilization, minimize idle time, and efficiently manage memory and communication across the distributed system.

Common Questions

Scaling is crucial because larger models demonstrate improved intelligence and performance. The trend in LLMs is towards larger parameter counts and training on vast amounts of data, which necessitates advanced scaling techniques.

Topics

Mentioned in this video

Software & Apps
torch.distributed.tensor.parallel.parallelize

A function or mechanism within PyTorch's distributed tensor parallelism library that facilitates combining different parallelism strategies.

IBGDA

Integrated Buffer Access or similar hardware feature enabling efficient communication between GPUs and CPUs for tasks like dispatching in Expert Parallelism.

RDMA

Remote Direct Memory Access, a networking technology that allows direct memory access between computing nodes, improving communication efficiency.

StarCoder 2

A project worked on by Nuaman Tazzy at HuggingFace.

PyTorch

An open-source machine learning framework used for implementing distributed training strategies like DDP and FSDP.

Small LM3

A project worked on by Nuaman Tazzy at HuggingFace.

Muon Optimizer

An optimizer that requires full tensors, posing a challenge for parameter sharding strategies that flatten tensors.

flash attention

An optimized attention mechanism that inspired the online softmax approach used in Ring Attention.

Nanotron

An open-source distributed training library developed by HuggingFace.

Kim K 2.6

An example of a recent large language model with one trillion parameters.

Megatron

A library for large-scale deep learning model training, mentioned in the context of combining parallelism and its implementation.

GPU

Graphics Processing Unit, a type of accelerator used for training large models, limited by VRAM.

FSDP

Fully Sharded Data Parallelism, a PyTorch implementation (also referred to as ZeRO Stage 3) that shards model parameters, gradients, and optimizer states across GPUs.

DDP

Distributed Data Parallelism, a class in PyTorch used for data parallelism that automatically handles gradient synchronization.

More from Stanford Online

View all 48 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free