Key Moments
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 2: PyTorch (einops)
Key Moments
Machine learning models are becoming incredibly complex, forcing us to meticulously account for even tiny resource inefficiencies, as even a 50% model flop utilization (MFU) is considered good.
Key Insights
Training a 70 billion parameter model on 15 trillion tokens is estimated to take 143 days using 1,024 H100 GPUs.
A 70 billion parameter model might fit into 53 billion parameters if using 8-bit quantization (2 bytes per parameter for weights, 2 for activations, 4 for optimizer states, 4 for gradients).
Matrix multiplications are generally compute-bound, with an arithmetic intensity of approximately 300 FLOPs per byte for H100s, meaning algorithms need to perform roughly this many operations per byte transferred to saturate the hardware.
The forward pass for training a deep network costs 2 * (number of data points) * (number of parameters) FLOPs, while the backward pass costs double that, leading to a total of 6 * (number of data points) * (number of parameters) FLOPs.
Mixed-precision training typically uses BF16 for parameters, activations, and gradients, while using FP32 for optimizer states to maintain stability.
Activation checkpointing can trade compute for memory, reducing activation memory by up to half by recomputing activations during the backward pass, with a potential L^2 compute overhead in extreme cases.
Understanding computational costs: FLOPs and resource limits
The lecture begins by highlighting the practical challenges of training large language models, emphasizing the need to optimize for finite resources like compute and memory. A key takeaway is the importance of resource accounting to maximize computational efficiency. To illustrate, an estimation is provided for training a 70 billion parameter model on 15 trillion tokens using 1,024 H100 GPUs, projecting a timeline of 143 days based on a formula where FLOPs are six times the number of parameters multiplied by the number of tokens. Another example calculates the maximum model size that can be trained on eight H100s, estimating around 53 billion parameters, considering memory constraints and the bytes per parameter (which sum to 2+2+4+4 for weights, activations, gradients, and optimizer states). The core principle is to understand the compute and memory characteristics before optimization, focusing on rough estimations rather than precise calculations.
Tensors, precision, and memory footprints
Tensors are the fundamental building blocks for storing all model components: parameters, gradients, optimizer states, data, and activations. The memory footprint of a tensor is determined by its size and the precision of its elements. Standard float32 (FP32) uses 32 bits (4 bytes) per element, common in scientific computing. However, deep learning often uses lower precisions for efficiency. Float16 (FP16) uses 16 bits (2 bytes) but suffers from a limited dynamic range, leading to instability like underflow and overflow. BF16 (Brain Floating Point) was developed to address this, offering the same dynamic range as FP32 but with reduced precision, making it a common 'sweet spot' for training. For instance, a 4x8 matrix in FP32 would occupy 128 bytes (4x8x4 bytes), whereas the same matrix in BF16 would occupy 64 bytes (4x8x2 bytes). Even lower precisions like FP8 and FP4 are emerging, though they often involve more complex implementations like block scaling to manage dynamic range.
Computation efficiency: FLOPs, MFU, and arithmetic intensity
Measuring computation involves FLOPs (floating-point operations). A flop is a basic arithmetic operation like addition or multiplication. The theoretical peak performance of hardware, like the 1,979 teraFLOPs for BF16 on an H100 (note: this is often divided by two for dense operations), serves as a benchmark. However, actual performance, measured as Model FLOPs Utilization (MFU), is typically much lower, often around 0.5 (50%). MFU is defined as actual FLOPs per second divided by theoretical peak FLOPs per second. This discrepancy is explained by memory bandwidth limitations. Arithmetic intensity (AI) measures the ratio of FLOPs to bytes transferred. Accelerators like the H100 have an AI of about 295 FLOPs/byte. Algorithms with AI lower than the accelerator's AI are memory-bound (e.g., ReLU activations with AI of 0.25), meaning they spend most time waiting for data. Algorithms with AI higher than the accelerator's AI are compute-bound (e.g., matrix multiplication with AI around 300 FLOPs/byte for large matrices), indicating they saturate the compute units.
Flops breakdown in neural networks: Forward and backward passes
In training, a significant portion of computation comes from matrix multiplications. For a simple linear layer mapping a D-dimensional input to a K-dimensional output with a batch size B, the forward pass requires 2*B*D*K FLOPs. For deep networks, a common approximation is that the forward pass costs 2 * (number of data points) * (number of parameters) FLOPs. The backward pass, which computes gradients, is computationally more expensive, typically costing twice as much as the forward pass. This is because it involves computing gradients with respect to both parameters and activations. Consequently, the total FLOPs for training a deep network (including forward and backward passes) is approximately 6 * (number of data points) * (number of parameters). This '6*N*D' formula (where N is tokens/data points and D is parameters) is a crucial estimate for understanding training costs.
Memory usage in training: Parameters, activations, and optimizer states
The memory required for training is substantial. It includes parameters (typically stored in BF16, 2 bytes per parameter), activations (dependent on batch size, sequence length, and model architecture, stored in BF16), and gradients (a copy of parameters, also BF16). A significant portion often comes from optimizer states; for instance, Adagrad requires the sum of squared gradients, using 4 bytes per parameter, while Adam stores first and second-order moments, totaling 8 bytes per parameter. Critically, optimizer states are often kept in FP32 for numerical stability, doubling the memory footprint compared to BF16 parameters. While these states may not be the computational bottleneck, they contribute heavily to the total memory required, limiting the size of models that can be trained on a given hardware.
Optimizing memory usage with gradient accumulation and activation checkpointing
To overcome memory limitations and enable larger effective batch sizes, techniques like gradient accumulation and activation checkpointing are employed. Gradient accumulation allows training with larger batch sizes than memory permits by accumulating gradients over smaller 'micro-batches' before performing a single optimizer step and parameter update. Activation checkpointing (or gradient checkpointing) is a more significant memory-saving technique. Instead of storing all intermediate activations from the forward pass in memory for gradient calculation during the backward pass, only a subset of activations is stored at 'checkpoints'. The activations between checkpoints are then recomputed during the backward pass. This trading of compute for memory can halve activation memory usage and offers tunable trade-offs, with extreme cases potentially increasing compute cost by a factor of L^2 (where L is the number of layers) but providing maximum memory efficiency.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Concepts
Common Questions
The primary goal is to maximize computational efficiency by understanding the compute and memory characteristics of the model and hardware. This helps in training the best possible model within a finite set of resources.
Topics
Mentioned in this video
Single-precision floating-point format using 32 bits, discussed as the default tensor type and its memory implications.
Half-precision floating-point format using 16 bits, discussed for its reduced memory footprint but poor dynamic range, leading to training instability.
BFloat16 format developed in 2018, balancing FP16's memory efficiency with FP32's dynamic range, often a sweet spot for deep learning.
An 8-bit floating-point format introduced more recently, with versions offering different dynamic range and resolution trade-offs, supported by NVIDIA's transformer engine.
A four-bit precision format, discussed as a very low precision option where blocks are scaled to represent more values.
A deep learning framework used for tensor operations, mixed precision training, and implementing models and optimizers.
A library for tensor manipulation using named dimensions, inspired by Einstein summation notation, simplifying complex tensor operations like matrix multiplication.
NVIDIA's parallel computing platform and API, used here to synchronize GPU operations during benchmarking.
An optimization algorithm that modifies learning rates based on the sum of squared past gradients, discussed as a precursor to Adam.
An optimization algorithm that combines momentum and adaptive learning rates, mentioned in contrast to Adagrad and relevant for Assignment 1.
Stochastic Gradient Descent, mentioned as a foundational optimization algorithm from which Adagrad and Adam evolved.
More from Stanford Online
View all 19 summaries
80 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization
109 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching
74 minStanford AA228V I Validation of Safety Critical Systems I Explainability
54 minStanford Robotics Seminar ENGR319 | Winter 2026 | Gen Control, Action Chunking, Moravec’s Paradox
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free