What is MUP initialization and why is it important?

MUP (Matrix Universal Parameterization) is a special initialization technique designed to stabilize the optimal learning rate across different model scales. It involves adjusting initializations and learning rates to make hyperparameters less dependent on scale.

What is the purpose of the Warm-up Stable Decay (WSD) learning rate schedule?

WSD is a learning rate schedule that includes a warm-up, a long stable phase, and a final decay. Its main advantage is allowing training runs to be easily restarted from stable checkpoints, making scaling experiments much more efficient without full retraining.

How does DeepSeek approach hyperparameter optimization for scaling?

DeepSeek directly fits scaling laws to estimate optimal batch size and learning rate. They conduct extensive grid searches at various scales to identify these optimal values and then use these fitted laws for larger models.

What new trends are emerging in scaling laws for LLMs in 2026?

Recent trends include analyzing the impact of sparsity in sparse models, replicating Chinchilla-like scaling laws, and investigating the relationship between loss and downstream accuracy. Many companies now consider core scaling machinery standard and detail it less.

How do batch size and learning rate trends differ across models and data sizes?

The optimal batch size appears to strongly depend on the total amount of data being trained on, often following a power law. Optimal learning rate, however, is influenced by both model size (larger models need smaller learning rates) and data size (more data may need higher learning rates).

What are the key differences between Muon and Adam optimizers?

Muon is designed to treat matrix-valued parameters (like those in attention or MLPs) differently from vector-valued ones by operating on their spectra. It uses a Newton-Schultz approximation to orthogonalize updates, aiming for more stable training, especially at smaller scales.

What are the core invariants behind the MUP initialization strategy?

MUP is based on two main assertions: activations at initialization should remain roughly the same size regardless of network width, and the change in activations after one gradient step should be of a fixed magnitude (feature learning).

What are the practical implications of MUP for hyperparameter tuning?

MUP suggests specific scaling rules for initializations and learning rates on a per-layer basis, often related to fan-in and fan-out. This aims to stabilize learning rates and prevent them from shifting drastically as model size increases.

How difficult is it to reliably extrapolate scaling laws to very large models?

Extrapolating scaling laws is challenging and often involves an 'art' rather than pure science. Even seemingly robust trends can deviate significantly or 'blow up' at larger scales, requiring careful experimentation and consideration of various factors beyond simple power laws.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 11: Scaling Laws

Stanford Online

Education5 min read78 min video

May 19, 2026|145 views|19

Stanford Stanford Online AI Artificial Intelligence

Save to Pod

Key Moments

TL;DR

Scaling neural networks requires careful hyperparameter tuning, with techniques like MUP and meticulous optimizer choices crucial for stable performance across different model sizes.

Key Insights

The MUP initialization technique aims to make the optimal learning rate stable across different model scales by adjusting initializations and learning rates per parameter.

Warm-up Stable Decay (WSD) learning rate schedules are a common and versatile technique that allows for easier continuation of training runs by stabilizing the learning rate for a significant portion of training.

DeepSeek's approach to hyperparameter scaling involves directly fitting scaling laws to estimate optimal batch sizes and learning rates through extensive grid searches across various scales.

Muon, an optimizer that treats matrices differently from vectors by operating on their spectral properties, shows significant gains at small scales but its effectiveness at large scales is still being actively researched.

The MUP program hypothesizes that key invariants for scaling are that activations at initialization remain constant, and after one gradient step, the change in activations is O(1), suggesting specific parameterization strategies.

Scaling laws, while powerful for guiding decisions on architectures, optimizers, and hyperparameters, are ultimately an art with no guaranteed silver bullet for extrapolation to all scales.

Stabilizing learning rates with MUP initialization

The lecture introduces MUP (Model-parameter-wise Unit-variance initialization) as a technique to ensure that the optimal learning rate remains consistent as model scales change. This is achieved by carefully scaling embedding outputs, residual connections, matrix initializations based on fan-out ratios, and implementing per-parameter learning rates. The core idea is to alter initializations and learning rates to stabilize learning, making hyperparameter tuning less scale-dependent. The mini CPM paper, discussed as an early example, used MUP to effectively achieve stable optimal learning rates across various model sizes. This approach aims to remove the need for extensive learning rate tuning by fixing it to a single optimum, though optimal batch size still varies with data and model size and requires careful consideration.

Warm-up Stable Decay (WSD) for efficient learning rate scheduling

To address the challenge of setting learning rate schedules when model and data scales vary, the Warm-up Stable Decay (WSD) learning rate schedule is presented. This schedule features a warm-up phase, a long stable phase where the learning rate is constant, and a rapid decay phase to zero. A key advantage of WSD is its ability to allow training runs to be restarted at any point after the stable phase, significantly reducing the cost of experiments that involve scaling data. This contrasts with cosine schedules, which require knowing the total training budget in advance. While WSD might slightly underperform cosine schedules in some cases, it offers comparable performance in many others and provides a versatile default learning rate strategy, making chinchilla-style analysis and data scaling experiments much more manageable.

DeepSeek's approach to fitting scaling laws for hyperparameters

DeepSeek offers an alternative strategy to stabilizing hyperparameters: directly fitting scaling laws to estimate optimal batch sizes and learning rates. This involves running extensive grid searches at various scales to pinpoint optimal hyperparameters, then fitting lines to these optima as a function of compute. The optimal batch size was found to scale with non-embedding training FLOPs, and similarly, the optimal learning rate was also correlated. While fitting lines to learning rates might seem less precise than other methods, the DeepSeek approach demonstrated its feasibility by achieving reasonable results. This method contrasts with MUP's goal of invariance and instead embraces predictable changes in hyperparameters with scale, fitting them into a predictive model.

Scaling hyperparameter dependencies from StepFun research

Further investigation into hyperparameter scaling, particularly from the StepFun research, highlights the complex relationships between learning rate, batch size, and model/data scale. A key, perhaps counter-intuitive, finding is that optimal batch size appears to depend primarily on the total amount of training data, following a power-law relationship. Conversely, the optimal learning rate shows a more complex dependence, decreasing with larger models but increasing with more data, though this latter trend might be fragile. The research also suggests that while learning rates are generally robust hyperparameters, optimal batch sizes can shift more significantly with changes in training data, indicating potential contingencies in these scaling models.

Optimizer advancements and scale dependency: The case of Muon

The lecture delves into the scale dependency of optimizers, highlighting Muon as a significant development. Muon treats matrix-valued parameters differently from vector-valued ones by operating on their spectral properties (singular values). This matrix-specific approach allows for regularization or scaling of updates based on the matrix's spectrum, potentially leading to faster convergence and stability, especially observed in small-scale benchmarks like the nanoGPT speedrun where it significantly outperformed Adam. However, the effectiveness of Muon at large scales is still an active research area. While initial studies suggested diminishing gains with scale, its successful implementation in models like Kimmy K2 demonstrates its viability at scale, even if direct comparisons to Adam at that scale are still needed.

MUP's conceptual framework for invariant scaling

The MUP (Model-parameter-wise Unit-variance) program offers a theoretical framework for achieving stable scaling. It's based on two core assertions: 1) activations at initialization should remain constant across different network widths, and 2) the change in activations after one gradient step should be O(1) (feature learning). These invariants lead to specific rules for initializing weights and setting per-layer learning rates, often expressed in terms of fan-in and fan-out ratios. The goal is to prevent optimal learning rates and other hyperparameters from shifting dramatically as the model size increases, simplifying the tuning process. While MUP has shown promise in stabilizing scaling laws and improving performance, it relies on several strong assumptions and can be sensitive to components not explicitly accounted for in its theory, such as certain optimizer variants or learned gain terms in normalization layers.

The art and science of scaling in practice

Scaling laws, while providing a scientific-sounding framework for guiding decisions in model architecture, hyperparameter tuning, and optimizer selection, are ultimately an art. The ability to extrapolate trends observed at smaller scales to much larger ones is not guaranteed. Practical application involves using techniques like MUP for stability or fitting scaling laws to predict optimal parameters, but these are tools to manage 'hyperparameter drift' rather than definitive scientific predictions. The inherent messiness and unknowns in scaling mean that while these methods increase the chances of success, there is no single 'silver bullet' solution yet. Continued research and experimentation are essential to better understand and master the complex domain of large-scale model training.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

●People Referenced

Common Questions

Key challenges include optimizing hyperparameters like learning rates, batch sizes, and initializations, which become highly scale-sensitive. It's also difficult to predict how small-scale experiment gains will transfer to large-scale training.

Topics

AI & Machine Learning Technology & Innovation Deep Learning Scaling Laws LLM Architecture Model Scaling Hyperparameter Optimization Language Model Training Optimizer Selection

Mentioned in this video

Software & Apps

Quen 2.5

A model that uses scaling experiments to tune hyperparameters like batch size and learning rate, following a similar approach to DeepSeek.

Hunan

An open-source model that, after switching to sparsity, performs scaling law analyses to understand the relationship between activated parameters and training loss.

Minimax 01

A model that conducted scaling experiments to compare different attention architectures (lightning, softmax, hybrid) and justify architectural decisions.

Adam

A standard optimizer that serves as a baseline for comparison with newer algorithms like Muon and is discussed in the context of hyperparameter tuning and scaling.

Muon

An optimizer that showed significant gains over Adam in small-scale experiments, particularly for the nanoGPT speedrun benchmark. It treats matrix-valued parameters differently.

Llama 3

A model with scaling loss analysis that shows the relationship between compute, log probability, and downstream accuracy, fitting a sigmoidal curve.

Lion

An optimizer mentioned as potentially breaking MUP due to its reliance on the sign of the gradient.

Books

Kaplan

Mentioned in the context of classical scaling laws cannon.

Mini CPM

A high-performance small language model from the Chinese open-source community, used as a case study for scaling laws and special initializations like MUP.

DeepSeek

The original DeepSeek paper is discussed for its serious scaling analysis and its approach to fitting scaling laws for optimal batch size and learning rate.

People

Greg Yang

The originator of the MUP research program, credited with a series of papers on tensor programs.

Jeremy Bernstein

A researcher who has worked on Muon and proposed ideas about layer-specific learning rates and optimizers.

Concepts

Warm-up Stable Decay

A learning rate schedule strategy characterized by a warm-up phase, a long stable phase, and a rapid decay phase, allowing for easier continuation of training runs.

Chinchilla

Re-analysis of Chinchilla scaling laws and token-to-model size ratio is discussed as a method used by MiniCPM.

Products

Kimmy K2

A recent model discussed for its focus on scaling laws, including sparsity and its training with the Muon optimizer.

Organizations

Stepfund

Researchers who conducted a large-scale hyperparameter tuning study, drawing from a recent preprint to analyze learning rates, batch size, and their impact on downstream losses.

Companies

Cerebrris

A chip company with a language model training arm that has published multiple papers on MUP scaling of language models.

OpenAI

Mentioned for their scaling law paper from Kaplan, which proposed scaling batch size as a function of terminal loss.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free