How do scaling laws relate to older machine learning concepts?

Scaling laws connect to classical ideas like empirical sample complexity and generalization bounds. Both concepts explore how model performance changes with the amount of data. Early work in the 1990s already investigated how classifier performance scaled with dataset size.

What are data scaling laws, and what do they typically show?

Data scaling laws focus on how model performance improves as the dataset size increases, while keeping other factors like model architecture fixed. They generally show a monotonic improvement in performance, often following a power-law relationship where error decays polynomially with more data.

Can scaling laws help with optimizing data mixtures for training?

Yes, data scaling laws can inform data mixture optimization. By training small models on different data mixes and analyzing their performance trends, one can extrapolate to find the optimal composition for larger-scale training runs, although real-world results can be noisier.

What is the role of data repetition in scaling models?

As data availability becomes a bottleneck, repeating data (epochs) is explored. Studies show that up to a certain point (e.g., four epochs), repetition doesn't significantly harm performance. Beyond that, realized scaling laws can diverge negatively from projected ones if fresh data isn't available.

How do scaling laws help in choosing model architectures or optimizers?

Scaling laws enable comparison of different architectures (like Transformers vs. LSTMs) or optimizers (like SGD vs. Adam) by training smaller versions. This analysis reveals their performance trends across compute scales, guiding decisions for large-scale model development without costly full training runs.

What is the 'critical batch size' in model training?

The critical batch size is a rule of thumb indicating the largest batch size that can be used before diminishing returns set in. It represents a balance point between perfect scaling (variance reduction) and bias-limited scaling (where further increases offer less benefit).

How does learning rate scale with model size?

Generally, as model width (size) increases, the learning rate should decrease. A common rule of thumb suggests scaling the learning rate inversely with width (1/width). Alternative methods re-parameterize the model to maintain optimal learning rate behavior across scales.

Why might upstream (e.g., perplexity) performance differ from downstream task performance?

While scaling laws are excellent predictors for upstream metrics like perplexity, their correlation with downstream task performance is less certain. A model with excellent perplexity might not necessarily excel at specific downstream tasks, requiring careful validation beyond core training metrics.

What is the Chinchilla paper's main contribution to scaling laws?

The Chinchilla paper challenged previous scaling predictions (like Kaplan's) by suggesting that for optimal compute utilization, models should be trained with a higher ratio of data to parameters (around 20 tokens per parameter), often implying smaller models than previously thought for a given compute budget.

Why did Kaplan and Chinchilla papers have different scaling predictions?

Discrepancies often arise from subtle differences in implementation choices, such as how parameters are counted (excluding embeddings/softmax layers), learning rate warmup strategies, and batch size tuning. These details can significantly shift scaling law estimations.

Should we always aim for the Chinchilla ratio (20:1 tokens to parameters)?

Not necessarily for production models. While the Chinchilla ratio is optimal for training compute efficiency in research, production models often benefit from being 'overtrained' (more data per parameter) to optimize inference serving costs and latency, prioritizing capability in smaller, more efficient models.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 9: Scaling Laws

Stanford Online

Education6 min read78 min video

Apr 30, 2026|537 views|32

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Key Moments

TL;DR

Language model performance scales predictably with parameters and data, but optimal ratios differ significantly, impacting training efficiency and model design choices.

Key Insights

Scaling laws, observed as linear trends on log-log plots, demonstrate that model performance (loss) predictably improves with increased compute, data, or parameters.

Early research in the 1990s by Vapnik and others already explored data scaling laws to estimate model performance without training on massive datasets.

Data scaling laws often exhibit polynomial decay in error, with exponents around -0.1, suggesting neural networks learn at a rate comparable to non-parametric regressors in 10 dimensions.

The critical batch size, a rule of thumb for optimal batch size, scales with target loss (compute) via an inverse power-law relationship.

The Chinchilla paper's findings, suggesting a 20:1 token-to-parameter ratio, differ from Kaplan's earlier work (which suggested a higher parameter-to-data ratio), highlighting the sensitivity of scaling laws to details like parameter counting and learning rate schedules.

For production models, overtraining (more tokens per parameter than the Chinchilla ratio) is often preferred to minimize serving costs, even if it means higher training compute.

What are scaling laws and why are they crucial?

Scaling laws are predictive rules that extrapolate small-scale model performance and behavior to large-scale scenarios, acting as a paradigm for designing and optimizing language models. Instead of wasteful large-scale tuning, scaling laws allow for optimization at small scales, providing confidence through simple, robust connections between small and large model behaviors. These laws reveal predictable relationships between resources (compute, data, parameters) and model performance (typically, test loss), often appearing as linear trends on log-log plots, which signify power-law or polynomial relationships. This approach is essential for making informed decisions when deploying potentially millions of dollars worth of compute for training large models.

Historical roots of scaling laws

The idea of scaling laws is not new and has roots in classical machine learning. Early theorists like Vapnik and Vapnik (1993) explored how data set size affects classifier performance, using smaller datasets to estimate future performance, a concept akin to a data scaling law. Canonical NLP work by Bangkok and Brill demonstrated that increasing data size predictably improves system performance. Later, Kachina et al. (2012) identified power-law functional forms for scaling laws in machine translation as a function of data. A significant contribution was made by Hestness et al. (2017), who demonstrated consistent polynomial scaling trends for various systems (speech, translation, language models) as a function of data set size, even predicting phenomena like emergence and the importance of compute, predating modern LLM scaling trends.

Data scaling laws: The predictable core

Data scaling laws describe the relationship between the amount of training data and model performance while keeping other factors fixed. Empirically, as the data set size increases, the test loss of a sufficiently large model typically decreases in a predictable, often linear, fashion on a log-log plot. This linearity indicates a power-law or polynomial decay in error. For instance, in a simple estimation problem like calculating the mean of a Gaussian, the error decays as 1/n, which yields a linear relationship on a log-log plot. While classical parametric estimation shows a 1/n scaling (slope of -1), neural network scaling laws often exhibit slower convergence rates, with exponents around -0.1. This has led some researchers to theorize that neural networks behave similarly to non-parametric regressors in higher dimensions (e.g., 10 dimensions), where error scales as n^(-1/d).

Extending scaling laws to data mixture and repetition

Beyond simply increasing data volume, scaling laws can inform data mixture strategies and the effects of data repetition. By training small models on different data mixtures (e.g., news vs. Wikipedia), one can fit curves to predict the performance of larger models with varying data compositions. Although real-world data mixture optimization can be noisier than idealized scaling law predictions, the principle holds that the optimal mixture at small scales often remains optimal at larger scales if slopes of scaling laws remain consistent across mixtures. Regarding data repetition, studies show that up to roughly four epochs, performance doesn't significantly degrade; however, beyond this point, actual scaling laws diverge negatively from projected ones if fresh data were available. This suggests careful consideration of data reuse, especially as compute grows and data availability may not.

Architecture and optimizer scaling: Engineering choices

Scaling laws provide a framework for evaluating architectural and optimizer choices without massive training runs. By training smaller variants and observing performance trends across compute ranges, one can compare, for example, Transformers versus LSTMs or different optimizers like SGD versus Adam. Typically, architecture and optimizer interventions show similar slopes in their data scaling laws but differ in intercepts, meaning that while the rate of improvement might be similar, one approach starts from a better or worse performance baseline. Papers like Hestness et al. and work by Google researchers on P5 models demonstrate how scaling trends reveal the effectiveness of architectural components (e.g., gated linear units over Performers) and reveal that even significant interventions like switching from Adam to SGD often result in parallel shifts rather than changes in the scaling slopes.

Hyperparameter tuning and the critical batch size

Key hyperparameters like batch size and learning rate require careful tuning, especially at scale, and scaling laws offer guidance. The concept of 'critical batch size' suggests that increasing batch size yields perfect returns up to a certain point (noise-limited regime) before diminishing returns set in (bias-limited regime). This critical batch size, estimated through a process balancing steps and examples needed to reach a target loss, itself scales with compute. As compute increases (lower target loss), the critical batch size also increases, following a power-law relation. Similarly, learning rate scaling is crucial. A common rule of thumb is to decrease the learning rate inversely with model width. More advanced techniques like MUP (Model Update Parameters) aim to maintain optimal learning rate minimums across scales by reparameterizing the model, offering alternative strategies to traditional scaling.

The Kaplan vs. Chinchilla debate on optimal scaling

A significant point of contention in scaling laws arose between Kaplan et al. (2020) and Hoffman et al. (Chinchilla, 2022) regarding the optimal balance between model size and data amount for a given compute budget. Kaplan's work suggested training much larger models with less data (e.g., ~3x more parameters than tokens), driven by their observations and parameter counting methods (excluding embeddings and last layer). In contrast, Chinchilla advocated for a more balanced approach, suggesting a 20:1 token-to-parameter ratio, by using more robust fitting methods (like isoflops and lower envelope) and more careful parameter counting. Discrepancies were later attributed to subtle but impactful implementation details, such as how parameters are counted (including/excluding embeddings/softmax layers) and learning rate warm-up schedules, demonstrating that scaling laws are engineered and highly sensitive to experimental setup.

Practical implications: Transfer, overtraining, and isoflops

While scaling laws predict upstream performance (like perplexity) very reliably, transferring these gains to downstream tasks is less certain and requires careful consideration. For production models, the goal often shifts from minimizing training compute (as suggested by Chinchilla's ratio) to optimizing inference and serving costs. This frequently leads to 'overtraining' (more tokens per parameter than research optimal ratios), resulting in smaller, more capable models for deployment. The 'isoflops' method, where a fixed compute budget is used to sweep across other degrees of freedom (like parameter-to-data ratios), has endured as a robust research tool. It allows for a systematic exploration of the performance surface, providing reliable insights into trade-offs and informing decisions for both research and practical model development.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Studies Cited

●Concepts

●People Referenced

Common Questions

Scaling laws are predictive rules that describe how a model's performance (like test loss) changes as resources (such as compute, model size, or dataset size) are increased. They allow researchers to extrapolate behaviors from small-scale experiments to large-scale systems, optimizing training cost and efficiency.

Topics

AI & Machine Learning Technology & Innovation Language Models Model Architecture Deep Learning Scaling Laws Hyperparameter Tuning Compute Optimization Training Efficiency

Mentioned in this video

Products

B200

A type of GPU mentioned as a resource for training large language models, implying significant computational power.

Companies

OpenAI

Mentioned in the context of their neural scaling loss paper (Kaplan et al., 2020) and scaling law research, particularly regarding model size and data.

DeepMind

The research lab that published the Chinchilla paper, which offered a different perspective on optimal scaling laws compared to OpenAI's earlier Kaplan paper.

People

Vladimir Vapnik

A key figure in machine learning theory, associated with early work on generalization bounds and scaling laws, specifically data scaling back in 1993.

Studies & Research

Hessness et al. 2017

A paper cited as a foundational work in neural scaling laws, predating modern large language models, and exploring scaling as a function of dataset size across various systems.

Kaplan et al. 2020

Referred to multiple times as the 'OpenAI's neural scaling loss paper,' it is a foundational work on scaling laws, particularly regarding compute, dataset size, and parameters.

Chinchilla

A significant paper that challenged Kaplan's scaling predictions, suggesting smaller models with more data relative to parameters (a 20:1 token-to-parameter ratio) are more compute-optimal.

Software & Apps

Mamba

Mentioned as an example of a successful State Space Model (SSM) architecture, contrasting with traditional Transformers and LSTMs in scaling law analysis.

SGD

Stochastic Gradient Descent, an optimization algorithm discussed in the context of scaling laws and compared against Adam, highlighting similar scaling trends despite different methodologies.

Adam

A common optimization algorithm, discussed in relation to SGD and scaling laws, showing that different optimizers often have similar scaling behaviors.

GPT-3

Mentioned as a large model from the era that followed Kaplan's scaling laws, which trained very large models. It's also noted as being undertrained compared to later models like Chinchilla.

Concepts

A style of models used in research by Google (EK et al.) to conduct scaling studies across different architectures, helping to identify trends in model development.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free