How does reducing floating-point precision affect training?

Reducing precision, like from FP32 to FP16 or BF16, saves memory and can speed up computations. However, lower precision formats like FP16 can lead to training instability due to limited dynamic range. BF16 offers a better balance for deep learning.

What is the difference between FP16 and BF16?

FP16 uses 16 bits but has a limited dynamic range, often causing overflow or underflow issues during training. BF16 also uses 16 bits but shifts bits from mantissa to exponent, providing a larger dynamic range similar to FP32, though with less resolution.

What is Einops and why is it useful in tensor manipulation?

Einops is a library that allows tensor operations using named dimensions, inspired by Einstein summation notation. It simplifies complex operations like matrix multiplication by providing clear bookkeeping and eliminating the need for manual transposing and indexing.

How is computational cost measured in deep learning?

Computational cost is primarily measured by the number of floating-point operations (flops). A distinction is made between flops (amount of computation) and flops per second (hardware speed).

What is Model Flops Utilization (MFU)?

MFU is the ratio of actual achieved floating-point operations per second to the theoretical peak performance advertised by hardware specifications. An MFU of 0.5 means the hardware is achieving 50% of its theoretical capability.

What is arithmetic intensity and how does it relate to performance bottlenecks?

Arithmetic intensity is the ratio of computation (flops) to data movement (bytes). If an algorithm's arithmetic intensity is lower than the hardware's theoretical intensity, it's memory-bound; if higher, it's compute-bound. Matrix multiplications typically have high arithmetic intensity.

Why are matrix multiplications generally compute-bound while other operations are memory-bound?

Matrix multiplications offer a high ratio of computations to data moved (high arithmetic intensity), allowing them to saturate compute units. Most other operations, like element-wise functions or vector products, move more data relative to the computations performed, making them memory-bound.

What is the total flop count for training a typical deep network?

For a deep network, the forward pass requires 2 * (number of data points) * (number of parameters) flops, and the backward pass is twice that. The total is approximately 6 * (number of data points) * (number of parameters) flops.

What is the role of optimizer states in memory usage?

Optimizer states (like momentum or squared gradients in Adam) often require additional memory, typically using FP32 for stability, which can be a significant portion of total memory usage, impacting the ability to fit large models.

How do gradient accumulation and activation checkpointing help reduce memory usage?

Gradient accumulation allows for larger logical batch sizes by accumulating gradients over micro-batches before an update, saving memory. Activation checkpointing reduces memory by recomputing intermediate activations during the backward pass instead of storing them all.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 2: PyTorch (einops)

Stanford Online

Education5 min read78 min video

Apr 14, 2026|516 views|39

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Key Moments

TL;DR

Machine learning models are becoming incredibly complex, forcing us to meticulously account for even tiny resource inefficiencies, as even a 50% model flop utilization (MFU) is considered good.

Key Insights

Training a 70 billion parameter model on 15 trillion tokens is estimated to take 143 days using 1,024 H100 GPUs.

A 70 billion parameter model might fit into 53 billion parameters if using 8-bit quantization (2 bytes per parameter for weights, 2 for activations, 4 for optimizer states, 4 for gradients).

Matrix multiplications are generally compute-bound, with an arithmetic intensity of approximately 300 FLOPs per byte for H100s, meaning algorithms need to perform roughly this many operations per byte transferred to saturate the hardware.

The forward pass for training a deep network costs 2 * (number of data points) * (number of parameters) FLOPs, while the backward pass costs double that, leading to a total of 6 * (number of data points) * (number of parameters) FLOPs.

Mixed-precision training typically uses BF16 for parameters, activations, and gradients, while using FP32 for optimizer states to maintain stability.

Activation checkpointing can trade compute for memory, reducing activation memory by up to half by recomputing activations during the backward pass, with a potential L^2 compute overhead in extreme cases.

Understanding computational costs: FLOPs and resource limits

The lecture begins by highlighting the practical challenges of training large language models, emphasizing the need to optimize for finite resources like compute and memory. A key takeaway is the importance of resource accounting to maximize computational efficiency. To illustrate, an estimation is provided for training a 70 billion parameter model on 15 trillion tokens using 1,024 H100 GPUs, projecting a timeline of 143 days based on a formula where FLOPs are six times the number of parameters multiplied by the number of tokens. Another example calculates the maximum model size that can be trained on eight H100s, estimating around 53 billion parameters, considering memory constraints and the bytes per parameter (which sum to 2+2+4+4 for weights, activations, gradients, and optimizer states). The core principle is to understand the compute and memory characteristics before optimization, focusing on rough estimations rather than precise calculations.

Tensors, precision, and memory footprints

Tensors are the fundamental building blocks for storing all model components: parameters, gradients, optimizer states, data, and activations. The memory footprint of a tensor is determined by its size and the precision of its elements. Standard float32 (FP32) uses 32 bits (4 bytes) per element, common in scientific computing. However, deep learning often uses lower precisions for efficiency. Float16 (FP16) uses 16 bits (2 bytes) but suffers from a limited dynamic range, leading to instability like underflow and overflow. BF16 (Brain Floating Point) was developed to address this, offering the same dynamic range as FP32 but with reduced precision, making it a common 'sweet spot' for training. For instance, a 4x8 matrix in FP32 would occupy 128 bytes (4x8x4 bytes), whereas the same matrix in BF16 would occupy 64 bytes (4x8x2 bytes). Even lower precisions like FP8 and FP4 are emerging, though they often involve more complex implementations like block scaling to manage dynamic range.

Computation efficiency: FLOPs, MFU, and arithmetic intensity

Measuring computation involves FLOPs (floating-point operations). A flop is a basic arithmetic operation like addition or multiplication. The theoretical peak performance of hardware, like the 1,979 teraFLOPs for BF16 on an H100 (note: this is often divided by two for dense operations), serves as a benchmark. However, actual performance, measured as Model FLOPs Utilization (MFU), is typically much lower, often around 0.5 (50%). MFU is defined as actual FLOPs per second divided by theoretical peak FLOPs per second. This discrepancy is explained by memory bandwidth limitations. Arithmetic intensity (AI) measures the ratio of FLOPs to bytes transferred. Accelerators like the H100 have an AI of about 295 FLOPs/byte. Algorithms with AI lower than the accelerator's AI are memory-bound (e.g., ReLU activations with AI of 0.25), meaning they spend most time waiting for data. Algorithms with AI higher than the accelerator's AI are compute-bound (e.g., matrix multiplication with AI around 300 FLOPs/byte for large matrices), indicating they saturate the compute units.

Flops breakdown in neural networks: Forward and backward passes

In training, a significant portion of computation comes from matrix multiplications. For a simple linear layer mapping a D-dimensional input to a K-dimensional output with a batch size B, the forward pass requires 2*B*D*K FLOPs. For deep networks, a common approximation is that the forward pass costs 2 * (number of data points) * (number of parameters) FLOPs. The backward pass, which computes gradients, is computationally more expensive, typically costing twice as much as the forward pass. This is because it involves computing gradients with respect to both parameters and activations. Consequently, the total FLOPs for training a deep network (including forward and backward passes) is approximately 6 * (number of data points) * (number of parameters). This '6*N*D' formula (where N is tokens/data points and D is parameters) is a crucial estimate for understanding training costs.

Memory usage in training: Parameters, activations, and optimizer states

The memory required for training is substantial. It includes parameters (typically stored in BF16, 2 bytes per parameter), activations (dependent on batch size, sequence length, and model architecture, stored in BF16), and gradients (a copy of parameters, also BF16). A significant portion often comes from optimizer states; for instance, Adagrad requires the sum of squared gradients, using 4 bytes per parameter, while Adam stores first and second-order moments, totaling 8 bytes per parameter. Critically, optimizer states are often kept in FP32 for numerical stability, doubling the memory footprint compared to BF16 parameters. While these states may not be the computational bottleneck, they contribute heavily to the total memory required, limiting the size of models that can be trained on a given hardware.

Optimizing memory usage with gradient accumulation and activation checkpointing

To overcome memory limitations and enable larger effective batch sizes, techniques like gradient accumulation and activation checkpointing are employed. Gradient accumulation allows training with larger batch sizes than memory permits by accumulating gradients over smaller 'micro-batches' before performing a single optimizer step and parameter update. Activation checkpointing (or gradient checkpointing) is a more significant memory-saving technique. Instead of storing all intermediate activations from the forward pass in memory for gradient calculation during the backward pass, only a subset of activations is stored at 'checkpoints'. The activations between checkpoints are then recomputed during the backward pass. This trading of compute for memory can halve activation memory usage and offers tunable trade-offs, with extreme cases potentially increasing compute cost by a factor of L^2 (where L is the number of layers) but providing maximum memory efficiency.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Concepts

Common Questions

The primary goal is to maximize computational efficiency by understanding the compute and memory characteristics of the model and hardware. This helps in training the best possible model within a finite set of resources.

Topics

AI & Machine Learning Technology & Innovation Programming & Software Science & Mathematics Deep Learning Mixed Precision Performance Optimization GPU Computing Computational Efficiency Neural Network Training Tensor Operations Arithmetic Intensity

Mentioned in this video

Products

H100

GPU model mentioned for compute calculations and its performance specifications.

Companies

DeepSeek

Mentioned as an example of a model composed of various tensors.

NVIDIA

Manufacturer of GPUs, whose hardware stack and transformer engine support FP8 precision.

Concepts

FP32

Single-precision floating-point format using 32 bits, discussed as the default tensor type and its memory implications.

FP16

Half-precision floating-point format using 16 bits, discussed for its reduced memory footprint but poor dynamic range, leading to training instability.

BF16

BFloat16 format developed in 2018, balancing FP16's memory efficiency with FP32's dynamic range, often a sweet spot for deep learning.

FP8

An 8-bit floating-point format introduced more recently, with versions offering different dynamic range and resolution trade-offs, supported by NVIDIA's transformer engine.

FP4

A four-bit precision format, discussed as a very low precision option where blocks are scaled to represent more values.

Software & Apps

PyTorch

A deep learning framework used for tensor operations, mixed precision training, and implementing models and optimizers.

Einops

A library for tensor manipulation using named dimensions, inspired by Einstein summation notation, simplifying complex tensor operations like matrix multiplication.

CUDA

NVIDIA's parallel computing platform and API, used here to synchronize GPU operations during benchmarking.

Adagrad

An optimization algorithm that modifies learning rates based on the sum of squared past gradients, discussed as a precursor to Adam.

Adam

An optimization algorithm that combines momentum and adaptive learning rates, mentioned in contrast to Adagrad and relevant for Assignment 1.

SGD

Stochastic Gradient Descent, mentioned as a foundational optimization algorithm from which Adagrad and Adam evolved.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free