Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 2: PyTorch (einops)

Stanford OnlineStanford Online
Education5 min read78 min video
Apr 14, 2026|516 views|39
Save to Pod
TL;DR

Machine learning models are becoming incredibly complex, forcing us to meticulously account for even tiny resource inefficiencies, as even a 50% model flop utilization (MFU) is considered good.

Key Insights

1

Training a 70 billion parameter model on 15 trillion tokens is estimated to take 143 days using 1,024 H100 GPUs.

2

A 70 billion parameter model might fit into 53 billion parameters if using 8-bit quantization (2 bytes per parameter for weights, 2 for activations, 4 for optimizer states, 4 for gradients).

3

Matrix multiplications are generally compute-bound, with an arithmetic intensity of approximately 300 FLOPs per byte for H100s, meaning algorithms need to perform roughly this many operations per byte transferred to saturate the hardware.

4

The forward pass for training a deep network costs 2 * (number of data points) * (number of parameters) FLOPs, while the backward pass costs double that, leading to a total of 6 * (number of data points) * (number of parameters) FLOPs.

5

Mixed-precision training typically uses BF16 for parameters, activations, and gradients, while using FP32 for optimizer states to maintain stability.

6

Activation checkpointing can trade compute for memory, reducing activation memory by up to half by recomputing activations during the backward pass, with a potential L^2 compute overhead in extreme cases.

Understanding computational costs: FLOPs and resource limits

The lecture begins by highlighting the practical challenges of training large language models, emphasizing the need to optimize for finite resources like compute and memory. A key takeaway is the importance of resource accounting to maximize computational efficiency. To illustrate, an estimation is provided for training a 70 billion parameter model on 15 trillion tokens using 1,024 H100 GPUs, projecting a timeline of 143 days based on a formula where FLOPs are six times the number of parameters multiplied by the number of tokens. Another example calculates the maximum model size that can be trained on eight H100s, estimating around 53 billion parameters, considering memory constraints and the bytes per parameter (which sum to 2+2+4+4 for weights, activations, gradients, and optimizer states). The core principle is to understand the compute and memory characteristics before optimization, focusing on rough estimations rather than precise calculations.

Tensors, precision, and memory footprints

Tensors are the fundamental building blocks for storing all model components: parameters, gradients, optimizer states, data, and activations. The memory footprint of a tensor is determined by its size and the precision of its elements. Standard float32 (FP32) uses 32 bits (4 bytes) per element, common in scientific computing. However, deep learning often uses lower precisions for efficiency. Float16 (FP16) uses 16 bits (2 bytes) but suffers from a limited dynamic range, leading to instability like underflow and overflow. BF16 (Brain Floating Point) was developed to address this, offering the same dynamic range as FP32 but with reduced precision, making it a common 'sweet spot' for training. For instance, a 4x8 matrix in FP32 would occupy 128 bytes (4x8x4 bytes), whereas the same matrix in BF16 would occupy 64 bytes (4x8x2 bytes). Even lower precisions like FP8 and FP4 are emerging, though they often involve more complex implementations like block scaling to manage dynamic range.

Computation efficiency: FLOPs, MFU, and arithmetic intensity

Measuring computation involves FLOPs (floating-point operations). A flop is a basic arithmetic operation like addition or multiplication. The theoretical peak performance of hardware, like the 1,979 teraFLOPs for BF16 on an H100 (note: this is often divided by two for dense operations), serves as a benchmark. However, actual performance, measured as Model FLOPs Utilization (MFU), is typically much lower, often around 0.5 (50%). MFU is defined as actual FLOPs per second divided by theoretical peak FLOPs per second. This discrepancy is explained by memory bandwidth limitations. Arithmetic intensity (AI) measures the ratio of FLOPs to bytes transferred. Accelerators like the H100 have an AI of about 295 FLOPs/byte. Algorithms with AI lower than the accelerator's AI are memory-bound (e.g., ReLU activations with AI of 0.25), meaning they spend most time waiting for data. Algorithms with AI higher than the accelerator's AI are compute-bound (e.g., matrix multiplication with AI around 300 FLOPs/byte for large matrices), indicating they saturate the compute units.

Flops breakdown in neural networks: Forward and backward passes

In training, a significant portion of computation comes from matrix multiplications. For a simple linear layer mapping a D-dimensional input to a K-dimensional output with a batch size B, the forward pass requires 2*B*D*K FLOPs. For deep networks, a common approximation is that the forward pass costs 2 * (number of data points) * (number of parameters) FLOPs. The backward pass, which computes gradients, is computationally more expensive, typically costing twice as much as the forward pass. This is because it involves computing gradients with respect to both parameters and activations. Consequently, the total FLOPs for training a deep network (including forward and backward passes) is approximately 6 * (number of data points) * (number of parameters). This '6*N*D' formula (where N is tokens/data points and D is parameters) is a crucial estimate for understanding training costs.

Memory usage in training: Parameters, activations, and optimizer states

The memory required for training is substantial. It includes parameters (typically stored in BF16, 2 bytes per parameter), activations (dependent on batch size, sequence length, and model architecture, stored in BF16), and gradients (a copy of parameters, also BF16). A significant portion often comes from optimizer states; for instance, Adagrad requires the sum of squared gradients, using 4 bytes per parameter, while Adam stores first and second-order moments, totaling 8 bytes per parameter. Critically, optimizer states are often kept in FP32 for numerical stability, doubling the memory footprint compared to BF16 parameters. While these states may not be the computational bottleneck, they contribute heavily to the total memory required, limiting the size of models that can be trained on a given hardware.

Optimizing memory usage with gradient accumulation and activation checkpointing

To overcome memory limitations and enable larger effective batch sizes, techniques like gradient accumulation and activation checkpointing are employed. Gradient accumulation allows training with larger batch sizes than memory permits by accumulating gradients over smaller 'micro-batches' before performing a single optimizer step and parameter update. Activation checkpointing (or gradient checkpointing) is a more significant memory-saving technique. Instead of storing all intermediate activations from the forward pass in memory for gradient calculation during the backward pass, only a subset of activations is stored at 'checkpoints'. The activations between checkpoints are then recomputed during the backward pass. This trading of compute for memory can halve activation memory usage and offers tunable trade-offs, with extreme cases potentially increasing compute cost by a factor of L^2 (where L is the number of layers) but providing maximum memory efficiency.

Common Questions

The primary goal is to maximize computational efficiency by understanding the compute and memory characteristics of the model and hardware. This helps in training the best possible model within a finite set of resources.

Topics

Mentioned in this video

More from Stanford Online

View all 19 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free