What are the main challenges when training a trillion-parameter model?

Training such large models strains infrastructure due to massive data loading requirements, the need for fast single-iteration times (around 1 second), limitations in GPU memory (VRAM), and the necessity of frequent checkpointing to storage.

How does Data Parallelism (DP) work for training large models?

In DP, each GPU processes a different batch of data, computing gradients independently. These gradients are then synchronized across all GPUs using an 'all-reduce' operation, ensuring all models remain identical after the optimizer step.

What are the limitations of standard Data Parallelism?

DP duplicates optimizer states and model parameters across GPUs, leading to inefficient memory usage. It also requires a larger global batch size with more GPUs, and crucially, assumes the entire model fits into GPU memory, which is not always true for massive models.

How do ZeRO optimizer stages (ZeRO-1, ZeRO-2, ZeRO-3) improve memory efficiency?

ZeRO-1 shards optimizer states, ZeRO-2 additionally discards unused gradients, and ZeRO-3 further shards the model parameters themselves. This progressively reduces memory redundancy across GPUs, with ZeRO-3 (FSDP) being the most memory-efficient.

What is Tensor Parallelism (TP) and how does it shard models?

TP shards the model's weights and computations across GPUs. For example, in matrix multiplications, weights can be split column-wise or row-wise, allowing each GPU to perform only a part of the calculation, thus reducing memory footprint per GPU.

What is Sequence Parallelism (SP) and why is it used?

SP is a flavor of TP that addresses challenges with large sequence lengths by sharding activations along the sequence dimension. It helps manage exploding activation sizes and can improve memory efficiency, especially when combined with TP.

How does Pipeline Parallelism (PP) work?

PP involves vertically sharding the model, placing different layers on different GPUs. This allows for processing different micro-batches concurrently, but requires careful scheduling to minimize GPU idle time (pipeline bubbles).

What is the challenge with Expert Parallelism (EP) in MoE models?

EP requires sharding experts across GPUs, but managing the routing of tokens to the correct experts and synchronizing gradients across GPUs that process different data is complex and communication-heavy, often requiring 'all-to-all' operations.

How can different parallelism strategies be combined?

Parallelism strategies like DP, TP, and SP are often orthogonal, meaning they can be combined. Libraries like Megatron and Nanotron facilitate this by initializing process groups for each parallelism dimension, allowing communication operations to be targeted.

What is the main bottleneck when scaling to thousands of GPUs?

The network interconnect (like NVLink or InfiniBand) becomes a critical bottleneck. The efficiency of communication operations and the ability to hide latency through overlapping computation and communication are paramount.

Are there automatic ways to determine the best parallelism strategy?

While research papers exist on this topic, the optimal strategy depends heavily on specific factors like the amount of data, global batch size, network bandwidth, and hardware configuration. It often requires experimentation and tuning.

Key Moments

Stanford CS25: Transformers United V6 I The Ultra-Scale Talk: Scaling Training to Thousands of GPUs

Stanford Online

Education5 min read62 min video

May 11, 2026|696 views|29|2

Stanford Stanford Online Transformers AI Artificial Intelligence

Save to Pod

Key Moments

TL;DR

Training LLMs on thousands of GPUs requires a complex blend of data, tensor, pipeline, sequence, and expert parallelism, each with trade-offs between memory, communication, and implementation complexity.

Key Insights

Large Language Models (LLMs) are scaling to trillions of parameters and trained on massive datasets (e.g., 15 trillion tokens) with context lengths up to 1 million.

Data Parallelism (DP) variants like ZeRO-1, ZeRO-2, and ZeRO-3 (FSDP) progressively reduce memory by sharding optimizer states, gradients, and model parameters, with ZeRO-3 offering the most memory savings at the cost of increased communication.

Tensor Parallelism (TP) shards model weights and computation across GPUs, enabling larger models but requiring significant communication (ALL_REDUCE) at each block and careful synchronization of parameters outside the TP region.

Sequence Parallelism (SP) is a flavor of TP that shards activations along the sequence dimension, reducing memory for activations and eliminating the need to synchronize LayerNorm gradients but still incurring high communication costs.

Pipeline Parallelism (PP) vertically shards layers across GPUs, reducing memory per GPU but introducing a 'pipeline bubble' (idle time) that requires complex schedulers (e.g., 1F1B, DualPipe) or many micro-batches to mitigate.

Expert Parallelism (EP) shards experts in Mixture-of-Experts (MoE) models, requiring ALL_TO_ALL communication for routing tokens and often combined with DP and PP to manage communication overhead and data parallelism requirements.

The imperative for scaling LLMs

The current trend in Large Language Models (LLMs) is relentless scaling, driven by observed correlations between model size and intelligence. Models now routinely reach trillions of parameters, trained on datasets of 15 trillion tokens with context lengths extending to 1 million. This necessitates massive computational infrastructure, placing immense pressure on GPUs for data loading, computation, and checkpointing within tight time constraints (e.g., 1 second per training iteration for a trillion-parameter model).

Mastering memory with ZeRO and data parallelism

Dealing with models that exceed single GPU memory requires distributed strategies. Data Parallelism (DP) is a foundational technique where data is sharded across GPUs, each holding a replica of the model and gradients. To synchronize gradients, the ALL_REDUCE collective operation is used. Efficient overlap of communication with computation, achieved by bucketing gradients, is crucial to minimize GPU idle time. However, standard DP duplicates optimizer states and can't address models larger than GPU memory. The ZeRO (Zero Redundancy Optimizer) family, including ZeRO-1, ZeRO-2, and ZeRO-3 (also known as Fully Sharded Data Parallelism or FSDP), progressively optimizes memory. ZeRO-1 shards optimizer states, ZeRO-2 further shards gradients, and ZeRO-3 shards model parameters as well. ZeRO-3 significantly reduces memory by ensuring each GPU only holds a shard of parameters, optimizer states, and gradients, allowing for training of massive models. The trick involves replacing ALL_REDUCE with REDUCE_SCATTER followed by ALL_GATHER, enabling distributed optimization steps.

Sharding models with tensor and sequence parallelism

When models are too large even for ZeRO strategies, Tensor Parallelism (TP) offers a solution by sharding the model's weights and computations across multiple GPUs. For instance, in matrix multiplications, weights can be split column-wise and row-wise, requiring ALL_REDUCE operations in both forward and backward passes to maintain correctness. This reduces the memory footprint and computation per GPU but introduces significant communication overhead at each block. A critical assumption for TP is that activations and gradients remain synchronized across GPUs outside the sharded dimension, which often requires gradient synchronization for parameters not within the TP region due to numerical drift. Sequence Parallelism (SP) further optimizes TP by sharding activations along the sequence dimension. This is particularly beneficial for long context training. SP leverages the REDUCE_SCATTER and ALL_GATHER primitives, similar to ZeRO, to shard activations and synchronize them, thereby reducing memory usage and eliminating the need for explicit LayerNorm gradient synchronization. However, like TP, SP remains communication-heavy.

Stitching layers with pipeline parallelism

Pipeline Parallelism (PP) vertically shards the model by distributing layers across different GPUs. GPU 1 handles initial layers, GPU 2 the next set, and so on. This significantly reduces the memory required per GPU as each only holds a fraction of the model layers. The primary challenge with PP is the 'pipeline bubble'—periods of GPU idle time as data moves between stages. To mitigate this, complex scheduling strategies like '1 forward, 1 backward' (1F1B) or more advanced methods like DualPipe are employed. These schedules interleave forward and backward passes for multiple micro-batches to keep GPUs busy. PP requires saving activations for multiple micro-batches, often necessitating activation checkpointing or CPU offloading to manage memory. It scales batch size with the number of micro-batches and is relatively communication-cheap, involving only activation and gradient exchanges between adjacent stages.

Handling experts in Mixture-of-Experts (MoE) with expert parallelism

For Mixture-of-Experts (MoE) models, Expert Parallelism (EP) is essential. It shards the experts across GPUs. This requires not only sharding the experts but also the data across GPUs (making it less orthogonal to DP) to ensure all experts can be utilized. The core communication primitive in EP is ALL_TO_ALL, used for dispatching tokens to their assigned experts and combining results. A significant bottleneck in EP is the CPU-GPU synchronization required for the router to compute expert assignments and buffer sizes, especially without specialized hardware like InfiniBand with IBGDA. This CPU-GPU sync can drastically slow down training. To address the communication overhead and CPU-GPU sync, EP is often combined with PP, interleaving the ALL_TO_ALL operations with forward/backward passes of other batches.

Combining parallelism strategies for ultra-scale training

Achieving ultra-scale training with thousands of GPUs necessitates combining these parallelism strategies. Data, Tensor, Pipeline, Sequence, and Expert Parallelism are designed to be orthogonal, allowing them to be composed into a 'parallelism mesh'. Frameworks like PyTorch (with FSDP, Tensor Parallelism utilities), Megatron-LM, and Nanotron abstract these complexities, enabling users to configure different parallelism dimensions. The choice of which strategies to employ, and to what degree, is highly dependent on hardware constraints (GPU memory, network interconnects like InfiniBand) and model characteristics (size, architecture, sequence length). For instance, TP is often confined to single nodes due to its communication intensity, while PP can span multiple nodes. The ultimate goal is to maximize hardware utilization, minimize idle time, and efficiently manage memory and communication across the distributed system.

Mentioned in This Episode

●Products

●Software & Apps

●Organizations

●Books

●Concepts

●People Referenced

Common Questions

Scaling is crucial because larger models demonstrate improved intelligence and performance. The trend in LLMs is towards larger parameter counts and training on vast amounts of data, which necessitates advanced scaling techniques.

Topics

AI & Machine Learning Technology & Innovation Programming & Software Large Language Models Distributed Training Data Parallelism GPU Computing Computational Efficiency Model Parallelism Optimizer States

Mentioned in this video

People

Nuaman Tazzy

Lead author of the Ultra-scale Playbook and core developer of Nanotron at HuggingFace.

Products

TPU

Tensor Processing Unit, another type of accelerator used for training, limited by memory.

Infiniband

A high-speed interconnect technology crucial for efficient distributed training, particularly relevant for features like IBGDA.

Concepts

Mixture-of-Experts

A scaling technique discussed in relation to large models, with Nuaman Tazzy's work spanning initiatives in this area.

Ring Attention

A parallelism strategy specifically designed to efficiently partition large sequences, particularly in attention mechanisms.

ZeRO

Zero-Redundancy Optimizer, a memory optimization technique for distributed training that shards optimizer states, gradients, and parameters.

CPU offloading

A memory optimization technique where activations are moved from GPU to CPU memory to free up GPU VRAM.

LLMs

Large Language Models, whose trend is to train larger and larger models, correlating with increased intelligence.

Software & Apps

torch.distributed.tensor.parallel.parallelize

A function or mechanism within PyTorch's distributed tensor parallelism library that facilitates combining different parallelism strategies.

IBGDA

Integrated Buffer Access or similar hardware feature enabling efficient communication between GPUs and CPUs for tasks like dispatching in Expert Parallelism.

RDMA

Remote Direct Memory Access, a networking technology that allows direct memory access between computing nodes, improving communication efficiency.

StarCoder 2

A project worked on by Nuaman Tazzy at HuggingFace.

PyTorch

An open-source machine learning framework used for implementing distributed training strategies like DDP and FSDP.

Small LM3

A project worked on by Nuaman Tazzy at HuggingFace.

Muon Optimizer

An optimizer that requires full tensors, posing a challenge for parameter sharding strategies that flatten tensors.

flash attention

An optimized attention mechanism that inspired the online softmax approach used in Ring Attention.

Nanotron

An open-source distributed training library developed by HuggingFace.

Kim K 2.6

An example of a recent large language model with one trillion parameters.

Megatron

A library for large-scale deep learning model training, mentioned in the context of combining parallelism and its implementation.

GPU

Graphics Processing Unit, a type of accelerator used for training large models, limited by VRAM.

FSDP

Fully Sharded Data Parallelism, a PyTorch implementation (also referred to as ZeRO Stage 3) that shards model parameters, gradients, and optimizer states across GPUs.

DDP

Distributed Data Parallelism, a class in PyTorch used for data parallelism that automatically handles gradient synchronization.

Books

Jack's scaling book

A book on scaling, mentioned as a resource for exploring optimal parallelism strategies and playing with different configurations.

Ultra-scale Playbook

A book published by HuggingFace that inspired the talk, covering ultra-scale training.

Companies

DeepSeek

An organization mentioned for its DualPipe implementation and Deep EP solution for expert parallelism.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free