FSDP

Software / App

Fully Sharded Data Parallelism, a memory-saving technique that shards parameters, gradients, and optimizer states across GPUs, often used in PyTorch.

Mentioned in 2 videos

Save the 2 videos on FSDP to your own pod.

Get Started Free

Videos Mentioning FSDP

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

Stanford Online

Fully Sharded Data Parallelism, a memory-saving technique that shards parameters, gradients, and optimizer states across GPUs, often used in PyTorch.

Stanford CS25: Transformers United V6 I The Ultra-Scale Talk: Scaling Training to Thousands of GPUs

Stanford Online

Fully Sharded Data Parallelism, a PyTorch implementation (also referred to as ZeRO Stage 3) that shards model parameters, gradients, and optimizer states across GPUs.