FSDP
Software / App
Fully Sharded Data Parallelism, a memory-saving technique that shards parameters, gradients, and optimizer states across GPUs, often used in PyTorch.
Mentioned in 2 videos
Save the 2 videos on FSDP to your own pod.
Sign up free to keep building your knowledge base on FSDP as more episodes are added.
Videos Mentioning FSDP

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism
Stanford Online
Fully Sharded Data Parallelism, a memory-saving technique that shards parameters, gradients, and optimizer states across GPUs, often used in PyTorch.

Stanford CS25: Transformers United V6 I The Ultra-Scale Talk: Scaling Training to Thousands of GPUs
Stanford Online
Fully Sharded Data Parallelism, a PyTorch implementation (also referred to as ZeRO Stage 3) that shards model parameters, gradients, and optimizer states across GPUs.