Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 9: Scaling Laws

Stanford OnlineStanford Online
Education6 min read78 min video
Apr 30, 2026|537 views|32
Save to Pod
TL;DR

Language model performance scales predictably with parameters and data, but optimal ratios differ significantly, impacting training efficiency and model design choices.

Key Insights

1

Scaling laws, observed as linear trends on log-log plots, demonstrate that model performance (loss) predictably improves with increased compute, data, or parameters.

2

Early research in the 1990s by Vapnik and others already explored data scaling laws to estimate model performance without training on massive datasets.

3

Data scaling laws often exhibit polynomial decay in error, with exponents around -0.1, suggesting neural networks learn at a rate comparable to non-parametric regressors in 10 dimensions.

4

The critical batch size, a rule of thumb for optimal batch size, scales with target loss (compute) via an inverse power-law relationship.

5

The Chinchilla paper's findings, suggesting a 20:1 token-to-parameter ratio, differ from Kaplan's earlier work (which suggested a higher parameter-to-data ratio), highlighting the sensitivity of scaling laws to details like parameter counting and learning rate schedules.

6

For production models, overtraining (more tokens per parameter than the Chinchilla ratio) is often preferred to minimize serving costs, even if it means higher training compute.

What are scaling laws and why are they crucial?

Scaling laws are predictive rules that extrapolate small-scale model performance and behavior to large-scale scenarios, acting as a paradigm for designing and optimizing language models. Instead of wasteful large-scale tuning, scaling laws allow for optimization at small scales, providing confidence through simple, robust connections between small and large model behaviors. These laws reveal predictable relationships between resources (compute, data, parameters) and model performance (typically, test loss), often appearing as linear trends on log-log plots, which signify power-law or polynomial relationships. This approach is essential for making informed decisions when deploying potentially millions of dollars worth of compute for training large models.

Historical roots of scaling laws

The idea of scaling laws is not new and has roots in classical machine learning. Early theorists like Vapnik and Vapnik (1993) explored how data set size affects classifier performance, using smaller datasets to estimate future performance, a concept akin to a data scaling law. Canonical NLP work by Bangkok and Brill demonstrated that increasing data size predictably improves system performance. Later, Kachina et al. (2012) identified power-law functional forms for scaling laws in machine translation as a function of data. A significant contribution was made by Hestness et al. (2017), who demonstrated consistent polynomial scaling trends for various systems (speech, translation, language models) as a function of data set size, even predicting phenomena like emergence and the importance of compute, predating modern LLM scaling trends.

Data scaling laws: The predictable core

Data scaling laws describe the relationship between the amount of training data and model performance while keeping other factors fixed. Empirically, as the data set size increases, the test loss of a sufficiently large model typically decreases in a predictable, often linear, fashion on a log-log plot. This linearity indicates a power-law or polynomial decay in error. For instance, in a simple estimation problem like calculating the mean of a Gaussian, the error decays as 1/n, which yields a linear relationship on a log-log plot. While classical parametric estimation shows a 1/n scaling (slope of -1), neural network scaling laws often exhibit slower convergence rates, with exponents around -0.1. This has led some researchers to theorize that neural networks behave similarly to non-parametric regressors in higher dimensions (e.g., 10 dimensions), where error scales as n^(-1/d).

Extending scaling laws to data mixture and repetition

Beyond simply increasing data volume, scaling laws can inform data mixture strategies and the effects of data repetition. By training small models on different data mixtures (e.g., news vs. Wikipedia), one can fit curves to predict the performance of larger models with varying data compositions. Although real-world data mixture optimization can be noisier than idealized scaling law predictions, the principle holds that the optimal mixture at small scales often remains optimal at larger scales if slopes of scaling laws remain consistent across mixtures. Regarding data repetition, studies show that up to roughly four epochs, performance doesn't significantly degrade; however, beyond this point, actual scaling laws diverge negatively from projected ones if fresh data were available. This suggests careful consideration of data reuse, especially as compute grows and data availability may not.

Architecture and optimizer scaling: Engineering choices

Scaling laws provide a framework for evaluating architectural and optimizer choices without massive training runs. By training smaller variants and observing performance trends across compute ranges, one can compare, for example, Transformers versus LSTMs or different optimizers like SGD versus Adam. Typically, architecture and optimizer interventions show similar slopes in their data scaling laws but differ in intercepts, meaning that while the rate of improvement might be similar, one approach starts from a better or worse performance baseline. Papers like Hestness et al. and work by Google researchers on P5 models demonstrate how scaling trends reveal the effectiveness of architectural components (e.g., gated linear units over Performers) and reveal that even significant interventions like switching from Adam to SGD often result in parallel shifts rather than changes in the scaling slopes.

Hyperparameter tuning and the critical batch size

Key hyperparameters like batch size and learning rate require careful tuning, especially at scale, and scaling laws offer guidance. The concept of 'critical batch size' suggests that increasing batch size yields perfect returns up to a certain point (noise-limited regime) before diminishing returns set in (bias-limited regime). This critical batch size, estimated through a process balancing steps and examples needed to reach a target loss, itself scales with compute. As compute increases (lower target loss), the critical batch size also increases, following a power-law relation. Similarly, learning rate scaling is crucial. A common rule of thumb is to decrease the learning rate inversely with model width. More advanced techniques like MUP (Model Update Parameters) aim to maintain optimal learning rate minimums across scales by reparameterizing the model, offering alternative strategies to traditional scaling.

The Kaplan vs. Chinchilla debate on optimal scaling

A significant point of contention in scaling laws arose between Kaplan et al. (2020) and Hoffman et al. (Chinchilla, 2022) regarding the optimal balance between model size and data amount for a given compute budget. Kaplan's work suggested training much larger models with less data (e.g., ~3x more parameters than tokens), driven by their observations and parameter counting methods (excluding embeddings and last layer). In contrast, Chinchilla advocated for a more balanced approach, suggesting a 20:1 token-to-parameter ratio, by using more robust fitting methods (like isoflops and lower envelope) and more careful parameter counting. Discrepancies were later attributed to subtle but impactful implementation details, such as how parameters are counted (including/excluding embeddings/softmax layers) and learning rate warm-up schedules, demonstrating that scaling laws are engineered and highly sensitive to experimental setup.

Practical implications: Transfer, overtraining, and isoflops

While scaling laws predict upstream performance (like perplexity) very reliably, transferring these gains to downstream tasks is less certain and requires careful consideration. For production models, the goal often shifts from minimizing training compute (as suggested by Chinchilla's ratio) to optimizing inference and serving costs. This frequently leads to 'overtraining' (more tokens per parameter than research optimal ratios), resulting in smaller, more capable models for deployment. The 'isoflops' method, where a fixed compute budget is used to sweep across other degrees of freedom (like parameter-to-data ratios), has endured as a robust research tool. It allows for a systematic exploration of the performance surface, providing reliable insights into trade-offs and informing decisions for both research and practical model development.

Common Questions

Scaling laws are predictive rules that describe how a model's performance (like test loss) changes as resources (such as compute, model size, or dataset size) are increased. They allow researchers to extrapolate behaviors from small-scale experiments to large-scale systems, optimizing training cost and efficiency.

Topics

Mentioned in this video

More from Stanford Online

View all 35 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free