Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 11: Scaling Laws

Stanford OnlineStanford Online
Education5 min read78 min video
May 19, 2026|145 views|19
Save to Pod
TL;DR

Scaling neural networks requires careful hyperparameter tuning, with techniques like MUP and meticulous optimizer choices crucial for stable performance across different model sizes.

Key Insights

1

The MUP initialization technique aims to make the optimal learning rate stable across different model scales by adjusting initializations and learning rates per parameter.

2

Warm-up Stable Decay (WSD) learning rate schedules are a common and versatile technique that allows for easier continuation of training runs by stabilizing the learning rate for a significant portion of training.

3

DeepSeek's approach to hyperparameter scaling involves directly fitting scaling laws to estimate optimal batch sizes and learning rates through extensive grid searches across various scales.

4

Muon, an optimizer that treats matrices differently from vectors by operating on their spectral properties, shows significant gains at small scales but its effectiveness at large scales is still being actively researched.

5

The MUP program hypothesizes that key invariants for scaling are that activations at initialization remain constant, and after one gradient step, the change in activations is O(1), suggesting specific parameterization strategies.

6

Scaling laws, while powerful for guiding decisions on architectures, optimizers, and hyperparameters, are ultimately an art with no guaranteed silver bullet for extrapolation to all scales.

Stabilizing learning rates with MUP initialization

The lecture introduces MUP (Model-parameter-wise Unit-variance initialization) as a technique to ensure that the optimal learning rate remains consistent as model scales change. This is achieved by carefully scaling embedding outputs, residual connections, matrix initializations based on fan-out ratios, and implementing per-parameter learning rates. The core idea is to alter initializations and learning rates to stabilize learning, making hyperparameter tuning less scale-dependent. The mini CPM paper, discussed as an early example, used MUP to effectively achieve stable optimal learning rates across various model sizes. This approach aims to remove the need for extensive learning rate tuning by fixing it to a single optimum, though optimal batch size still varies with data and model size and requires careful consideration.

Warm-up Stable Decay (WSD) for efficient learning rate scheduling

To address the challenge of setting learning rate schedules when model and data scales vary, the Warm-up Stable Decay (WSD) learning rate schedule is presented. This schedule features a warm-up phase, a long stable phase where the learning rate is constant, and a rapid decay phase to zero. A key advantage of WSD is its ability to allow training runs to be restarted at any point after the stable phase, significantly reducing the cost of experiments that involve scaling data. This contrasts with cosine schedules, which require knowing the total training budget in advance. While WSD might slightly underperform cosine schedules in some cases, it offers comparable performance in many others and provides a versatile default learning rate strategy, making chinchilla-style analysis and data scaling experiments much more manageable.

DeepSeek's approach to fitting scaling laws for hyperparameters

DeepSeek offers an alternative strategy to stabilizing hyperparameters: directly fitting scaling laws to estimate optimal batch sizes and learning rates. This involves running extensive grid searches at various scales to pinpoint optimal hyperparameters, then fitting lines to these optima as a function of compute. The optimal batch size was found to scale with non-embedding training FLOPs, and similarly, the optimal learning rate was also correlated. While fitting lines to learning rates might seem less precise than other methods, the DeepSeek approach demonstrated its feasibility by achieving reasonable results. This method contrasts with MUP's goal of invariance and instead embraces predictable changes in hyperparameters with scale, fitting them into a predictive model.

Scaling hyperparameter dependencies from StepFun research

Further investigation into hyperparameter scaling, particularly from the StepFun research, highlights the complex relationships between learning rate, batch size, and model/data scale. A key, perhaps counter-intuitive, finding is that optimal batch size appears to depend primarily on the total amount of training data, following a power-law relationship. Conversely, the optimal learning rate shows a more complex dependence, decreasing with larger models but increasing with more data, though this latter trend might be fragile. The research also suggests that while learning rates are generally robust hyperparameters, optimal batch sizes can shift more significantly with changes in training data, indicating potential contingencies in these scaling models.

Optimizer advancements and scale dependency: The case of Muon

The lecture delves into the scale dependency of optimizers, highlighting Muon as a significant development. Muon treats matrix-valued parameters differently from vector-valued ones by operating on their spectral properties (singular values). This matrix-specific approach allows for regularization or scaling of updates based on the matrix's spectrum, potentially leading to faster convergence and stability, especially observed in small-scale benchmarks like the nanoGPT speedrun where it significantly outperformed Adam. However, the effectiveness of Muon at large scales is still an active research area. While initial studies suggested diminishing gains with scale, its successful implementation in models like Kimmy K2 demonstrates its viability at scale, even if direct comparisons to Adam at that scale are still needed.

MUP's conceptual framework for invariant scaling

The MUP (Model-parameter-wise Unit-variance) program offers a theoretical framework for achieving stable scaling. It's based on two core assertions: 1) activations at initialization should remain constant across different network widths, and 2) the change in activations after one gradient step should be O(1) (feature learning). These invariants lead to specific rules for initializing weights and setting per-layer learning rates, often expressed in terms of fan-in and fan-out ratios. The goal is to prevent optimal learning rates and other hyperparameters from shifting dramatically as the model size increases, simplifying the tuning process. While MUP has shown promise in stabilizing scaling laws and improving performance, it relies on several strong assumptions and can be sensitive to components not explicitly accounted for in its theory, such as certain optimizer variants or learned gain terms in normalization layers.

The art and science of scaling in practice

Scaling laws, while providing a scientific-sounding framework for guiding decisions in model architecture, hyperparameter tuning, and optimizer selection, are ultimately an art. The ability to extrapolate trends observed at smaller scales to much larger ones is not guaranteed. Practical application involves using techniques like MUP for stability or fitting scaling laws to predict optimal parameters, but these are tools to manage 'hyperparameter drift' rather than definitive scientific predictions. The inherent messiness and unknowns in scaling mean that while these methods increase the chances of success, there is no single 'silver bullet' solution yet. Continued research and experimentation are essential to better understand and master the complex domain of large-scale model training.

Common Questions

Key challenges include optimizing hyperparameters like learning rates, batch sizes, and initializations, which become highly scale-sensitive. It's also difficult to predict how small-scale experiment gains will transfer to large-scale training.

Topics

Mentioned in this video

More from Stanford Online

View all 52 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free