Key Moments
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 11: Scaling Laws
Key Moments
Scaling neural networks requires careful hyperparameter tuning, with techniques like MUP and meticulous optimizer choices crucial for stable performance across different model sizes.
Key Insights
The MUP initialization technique aims to make the optimal learning rate stable across different model scales by adjusting initializations and learning rates per parameter.
Warm-up Stable Decay (WSD) learning rate schedules are a common and versatile technique that allows for easier continuation of training runs by stabilizing the learning rate for a significant portion of training.
DeepSeek's approach to hyperparameter scaling involves directly fitting scaling laws to estimate optimal batch sizes and learning rates through extensive grid searches across various scales.
Muon, an optimizer that treats matrices differently from vectors by operating on their spectral properties, shows significant gains at small scales but its effectiveness at large scales is still being actively researched.
The MUP program hypothesizes that key invariants for scaling are that activations at initialization remain constant, and after one gradient step, the change in activations is O(1), suggesting specific parameterization strategies.
Scaling laws, while powerful for guiding decisions on architectures, optimizers, and hyperparameters, are ultimately an art with no guaranteed silver bullet for extrapolation to all scales.
Stabilizing learning rates with MUP initialization
The lecture introduces MUP (Model-parameter-wise Unit-variance initialization) as a technique to ensure that the optimal learning rate remains consistent as model scales change. This is achieved by carefully scaling embedding outputs, residual connections, matrix initializations based on fan-out ratios, and implementing per-parameter learning rates. The core idea is to alter initializations and learning rates to stabilize learning, making hyperparameter tuning less scale-dependent. The mini CPM paper, discussed as an early example, used MUP to effectively achieve stable optimal learning rates across various model sizes. This approach aims to remove the need for extensive learning rate tuning by fixing it to a single optimum, though optimal batch size still varies with data and model size and requires careful consideration.
Warm-up Stable Decay (WSD) for efficient learning rate scheduling
To address the challenge of setting learning rate schedules when model and data scales vary, the Warm-up Stable Decay (WSD) learning rate schedule is presented. This schedule features a warm-up phase, a long stable phase where the learning rate is constant, and a rapid decay phase to zero. A key advantage of WSD is its ability to allow training runs to be restarted at any point after the stable phase, significantly reducing the cost of experiments that involve scaling data. This contrasts with cosine schedules, which require knowing the total training budget in advance. While WSD might slightly underperform cosine schedules in some cases, it offers comparable performance in many others and provides a versatile default learning rate strategy, making chinchilla-style analysis and data scaling experiments much more manageable.
DeepSeek's approach to fitting scaling laws for hyperparameters
DeepSeek offers an alternative strategy to stabilizing hyperparameters: directly fitting scaling laws to estimate optimal batch sizes and learning rates. This involves running extensive grid searches at various scales to pinpoint optimal hyperparameters, then fitting lines to these optima as a function of compute. The optimal batch size was found to scale with non-embedding training FLOPs, and similarly, the optimal learning rate was also correlated. While fitting lines to learning rates might seem less precise than other methods, the DeepSeek approach demonstrated its feasibility by achieving reasonable results. This method contrasts with MUP's goal of invariance and instead embraces predictable changes in hyperparameters with scale, fitting them into a predictive model.
Scaling hyperparameter dependencies from StepFun research
Further investigation into hyperparameter scaling, particularly from the StepFun research, highlights the complex relationships between learning rate, batch size, and model/data scale. A key, perhaps counter-intuitive, finding is that optimal batch size appears to depend primarily on the total amount of training data, following a power-law relationship. Conversely, the optimal learning rate shows a more complex dependence, decreasing with larger models but increasing with more data, though this latter trend might be fragile. The research also suggests that while learning rates are generally robust hyperparameters, optimal batch sizes can shift more significantly with changes in training data, indicating potential contingencies in these scaling models.
Optimizer advancements and scale dependency: The case of Muon
The lecture delves into the scale dependency of optimizers, highlighting Muon as a significant development. Muon treats matrix-valued parameters differently from vector-valued ones by operating on their spectral properties (singular values). This matrix-specific approach allows for regularization or scaling of updates based on the matrix's spectrum, potentially leading to faster convergence and stability, especially observed in small-scale benchmarks like the nanoGPT speedrun where it significantly outperformed Adam. However, the effectiveness of Muon at large scales is still an active research area. While initial studies suggested diminishing gains with scale, its successful implementation in models like Kimmy K2 demonstrates its viability at scale, even if direct comparisons to Adam at that scale are still needed.
MUP's conceptual framework for invariant scaling
The MUP (Model-parameter-wise Unit-variance) program offers a theoretical framework for achieving stable scaling. It's based on two core assertions: 1) activations at initialization should remain constant across different network widths, and 2) the change in activations after one gradient step should be O(1) (feature learning). These invariants lead to specific rules for initializing weights and setting per-layer learning rates, often expressed in terms of fan-in and fan-out ratios. The goal is to prevent optimal learning rates and other hyperparameters from shifting dramatically as the model size increases, simplifying the tuning process. While MUP has shown promise in stabilizing scaling laws and improving performance, it relies on several strong assumptions and can be sensitive to components not explicitly accounted for in its theory, such as certain optimizer variants or learned gain terms in normalization layers.
The art and science of scaling in practice
Scaling laws, while providing a scientific-sounding framework for guiding decisions in model architecture, hyperparameter tuning, and optimizer selection, are ultimately an art. The ability to extrapolate trends observed at smaller scales to much larger ones is not guaranteed. Practical application involves using techniques like MUP for stability or fitting scaling laws to predict optimal parameters, but these are tools to manage 'hyperparameter drift' rather than definitive scientific predictions. The inherent messiness and unknowns in scaling mean that while these methods increase the chances of success, there is no single 'silver bullet' solution yet. Continued research and experimentation are essential to better understand and master the complex domain of large-scale model training.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Books
●Concepts
●People Referenced
Common Questions
Key challenges include optimizing hyperparameters like learning rates, batch sizes, and initializations, which become highly scale-sensitive. It's also difficult to predict how small-scale experiment gains will transfer to large-scale training.
Topics
Mentioned in this video
A model that uses scaling experiments to tune hyperparameters like batch size and learning rate, following a similar approach to DeepSeek.
An open-source model that, after switching to sparsity, performs scaling law analyses to understand the relationship between activated parameters and training loss.
A model that conducted scaling experiments to compare different attention architectures (lightning, softmax, hybrid) and justify architectural decisions.
A standard optimizer that serves as a baseline for comparison with newer algorithms like Muon and is discussed in the context of hyperparameter tuning and scaling.
An optimizer that showed significant gains over Adam in small-scale experiments, particularly for the nanoGPT speedrun benchmark. It treats matrix-valued parameters differently.
A model with scaling loss analysis that shows the relationship between compute, log probability, and downstream accuracy, fitting a sigmoidal curve.
An optimizer mentioned as potentially breaking MUP due to its reliance on the sign of the gradient.
Mentioned in the context of classical scaling laws cannon.
A high-performance small language model from the Chinese open-source community, used as a case study for scaling laws and special initializations like MUP.
The original DeepSeek paper is discussed for its serious scaling analysis and its approach to fitting scaling laws for optimal batch size and learning rate.
A learning rate schedule strategy characterized by a warm-up phase, a long stable phase, and a rapid decay phase, allowing for easier continuation of training runs.
Re-analysis of Chinchilla scaling laws and token-to-model size ratio is discussed as a method used by MiniCPM.
More from Stanford Online
View all 52 summaries
101 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 6 - Model Training
83 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)
79 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation
69 minStanford CS153 Frontier Systems | Jensen Huang from NVIDIA on the Compute Behind Intelligence
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free