Why do modern language models use RMS Norm instead of Layer Norm?

RMS Norm is used because it is faster due to removing mean subtraction and bias terms, which are computationally less intensive and primarily memory-bound operations. This leads to better system efficiency without a significant loss in expressiveness.

What are Gated Linear Units (GLUs) and why are they popular?

GLUs are a type of activation function that uses a gating mechanism to modulate the output of a standard activation like ReLU. They are popular because they consistently provide performance boosts in language models, often with parameter-matched comparisons showing their superiority.

What is RoPE (Rotary Position Embedding) and how does it work?

RoPE encodes positional information by rotating word embeddings based on their position. It achieves relative positional encoding by ensuring that the inner product of embeddings depends only on the relative difference in positions, not their absolute locations.

What is the common ratio for feed-forward dimension to model dimension?

The common rule of thumb is a 4x ratio. However, for GLU variants, this ratio is adjusted down to around 2.5-2.7 to maintain parameter counts, and some models experiment with much higher ratios like T5's 64x.

What is the typical aspect ratio for transformer models?

The typical aspect ratio (width-to-depth ratio), often represented as d_model / n_layers, is around 100. This offers a balance between expressive power and hardware efficiency.

Why is weight decay still used even without clear overfitting in single-pass SGD?

Weight decay, especially when combined with learning rate decay, can act as an optimization intervention. It helps models converge to better minima, even when explicit regularization against overfitting isn't the primary concern.

How do models stabilize the output softmax?

The "Z-loss trick" can be used, which penalizes the log-normalizer (log Z) in the softmax calculation to keep it numerically stable, preventing blow-ups.

What is QK Norm and how does it improve attention stability?

QK Norm involves applying layer normalization to the queries (Q) and keys (K) before their matrix multiplication in the attention mechanism. This ensures the inputs to the softmax have a consistent scale, preventing attention degeneracies.

What is the difference between MQA and GQA?

Multi-Query Attention (MQA) uses a single key and value head for all attention heads, improving efficiency but reducing expressiveness. Grouped-Query Attention (GQA) offers a compromise by grouping heads, allowing control over the trade-off between efficiency and performance.

How do models handle long context efficiently?

Models often use sliding window attention, alternating between full attention and local attention over fixed windows, to manage the computational cost of long contexts without sacrificing too much performance.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 3: Architectures

Stanford Online

Education6 min read90 min video

Apr 15, 2026|1,452 views|50

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Key Moments

TL;DR

Modern LLMs overwhelmingly adopt pre-norm layer normalization and RMS norm, ditching biases for efficiency, even if it means sacrificing some expressiveness. These seemingly small changes significantly boost training stability and speed, highlighting the critical interplay between architecture and system performance.

Key Insights

All modern language models have moved the layer norm outside of the residual stream, with most using pre-norm for improved gradient propagation and stability, a change motivated by the need to remove warm-up periods.

RMS Norm is universally adopted over Layer Norm in modern models due to significant speed improvements without a substantial loss of expressiveness, reducing runtime by up to 25% by omitting mean subtraction and bias terms, which are computationally inefficient.

Gated Linear Units (GLUs), such as Swigloo and Giggloo, are now standard in nearly all credible modern language models, consistently showing performance gains over non-GLU variants, even in parameter-matched comparisons.

RoPE (Rotary Position Embeddings) has become dominant for position dependence since 2024, offering a relative positional embedding solution by rotating word vectors based on their position, ensuring invariance to absolute positions.

The ratio between feed-forward network size and model dimension is a critical hyperparameter, with a common rule of thumb being 4x for non-GLU models and around 2.67x for GLU variants.

Grouped Query Attention (GQA) offers a favorable trade-off between inference cost and performance, significantly reducing memory access for the KV cache while maintaining near multi-head attention performance levels.

Standardization of Layer Norm and Normalization Techniques

A near-universal consensus has emerged regarding layer normalization in modern language models. The original Transformer's post-norm placement within the residual stream has been largely abandoned for pre-norm configurations, where layer normalization precedes computations. This shift is primarily driven by enhanced gradient propagation and training stability, allowing models to converge better and deeper without the need for warm-up periods, a significant improvement over the initial Transformer design. Furthermore, RMS Norm has almost entirely replaced Layer Norm, not due to representational superiority, but for substantial system efficiency gains. While Layer Norm involves mean subtraction and bias terms, RMS Norm simplifies this to just scaling, eliminating operations that, while small in terms of floating-point operations (flops), can account for up to 25% of runtime due to their memory-intensive nature and low arithmetic intensity. This optimization is crucial for keeping GPUs 'hot' with intense computations rather than wasting resources on data movement.

The Rise of Gated Linear Units (GLUs) in MLPs

The feed-forward network (FFN) component within transformer blocks has seen a significant evolution with the widespread adoption of Gated Linear Units (GLUs). While simpler activations like ReLU or GELU can train functional models, GLUs, such as Swigloo and Giggloo, are now standard in almost all credible modern language models. These gated activations introduce a multiplicative gating mechanism that modulates the output of a linear transformation, empirically leading to consistent performance improvements. A common practice when implementing GLUs is to scale down the feed-forward dimension by a factor of 2/3 to maintain a similar overall parameter count as non-GLU architectures. This parameter-matched comparison, often seen in research papers, consistently demonstrates the superiority of GLUs across various benchmarks, making them a de facto standard for enhancing model expressiveness and performance.

Rotary Position Embeddings (RoPE) for Relative Positional Information

Accurately encoding positional information is critical for attention mechanisms, which are inherently position-agnostic. While early models used sinusoidal or absolute position embeddings, RoPE (Rotary Position Embeddings) has become the dominant approach since approximately 2024. RoPE is a relative position embedding method designed to ensure that the inner product of embeddings depends only on the relative difference between positions, not their absolute locations. It achieves this by rotating word vectors in a high-dimensional space based on their position. By decomposing the high-dimensional rotation into repeated 2D rotations, RoPE effectively encodes relative positional information. This design choice avoids the cross-terms found in absolute embeddings and provides a more principled way to handle relative positions compared to simply modifying attention scores. The widespread adoption of RoPE underscores its effectiveness in capturing sequential dependencies crucial for language understanding.

Hyperparameter Trade-offs in Feed-Forward Networks

The ratio between the feed-forward network (FFN) hidden dimension and the model dimension is a key hyperparameter. For standard non-GLU models, a multiplier of 4x is a common and effective rule of thumb. When GLUs are used, this ratio typically decreases to around 2.67x to maintain parameter parity, reflecting the increased parameter count within GLU layers. While some models, like T5, have explored extreme ratios (e.g., 64x) for potential systems efficiency gains, subsequent versions often revert to more standard ratios, suggesting that radical deviations may not offer commensurate benefits or could be computationally inefficient. Research, such as the Kaplan et al. (2020) scaling laws paper, indicates a broad 'sweet spot' for this ratio, where performance is relatively stable between approximately 1x and 10x. This suggests that while deviations are possible, adhering to established ratios generally ensures good performance without significant risk.

Stability Interventions: Z-Loss and QK Norm

Ensuring training stability, especially for increasingly large and expensive models, is paramount. Two significant interventions target this: the Z-loss trick and QK Norm. The Z-loss trick, pioneered by Jacob Develin, aims to stabilize the output softmax by penalizing the deviation of the log-normalizer (log Z) from zero, preventing blow-ups caused by extremely large or small Z values. This technique is employed in models like Bichuan, DCLM, and OMO. QK Norm, originating from multimodal research, integrates layer normalization directly into the attention mechanism by normalizing the queries (Q) and keys (K) before their dot product. This ensures that the inputs to the attention softmax have a consistent scale, thereby stabilizing the attention computations and preventing degeneracies, a common issue in large models. While QK Norm appears to have minimal impact on performance but significantly enhances stability, it has become a standard practice in many large language models.

Efficient Attention Mechanisms: GQA and Sliding Window

Inference cost is a major concern, particularly for serving large language models. Multi-Head Attention (MHA), while performant during training, leads to a large KV cache, increasing memory access and slowing down inference. Multi-Query Attention (MQA) addresses this by sharing keys and values across all heads, drastically reducing KV cache size and improving arithmetic intensity but often at the cost of model expressiveness. Grouped-Query Attention (GQA) offers a compromise: it groups queries, keys, and values, allowing for a controlled reduction in KV cache size while retaining more heads for queries than MQA. This results in a favorable trade-off, achieving inference costs close to MQA with performance nearly matching MHA. Additionally, Sliding Window Attention, an older idea revived recently, alternates between full attention and localized windowed attention. This hybrid approach effectively manages long context dependencies without the quadratic cost of global attention over very long sequences, seen in models like Coher Command A and LLaMA 4, striking a balance between context length, performance, and computational efficiency.

Hyperparameter Flexibility and Converging Defaults

While the architecture and training landscape is diverse, many hyperparameters have converged on effective defaults. The ratio of feed-forward dimension to model dimension (around 4x for non-GLU, 2.67x for GLU), the head dimension equalling model dimension divided by the number of heads, and an aspect ratio (model dimension / layers) of approximately 100 are common. Surprisingly, regularization techniques like weight decay are still popular, not primarily for overfitting prevention (which is rare in single-pass SGD training) but as an optimization intervention that can interact beneficially with optimizers and learning rate decay. Vocabulary size shows a clear dichotomy: smaller sizes (around 30,000) for monolingual models and larger ones (100,000-200,000) for multilingual or production systems. This convergence on effective defaults suggests that while radical hyperparameter choices might be possible, established ranges offer robust performance and computational efficiency.

Mentioned in This Episode

●Software & Apps

●Companies

●People Referenced

Common Questions

The most widely agreed-upon change is the placement of layer normalization outside the residual stream, typically before computations (pre-norm), which improves stability and gradient propagation compared to the original post-norm approach.

Topics

AI & Machine Learning Technology & Innovation Language Models Hyperparameter Tuning Transformer Architecture Attention Mechanisms Inference Efficiency Model Stability Positional Encoding

Mentioned in this video

Software & Apps

Llama

Mentioned as a model line that influenced many subsequent architectures, particularly in its copying by other researchers.

LLaMA 2

Highlighted as a significant model that led to many similar architectures being trained by the community.

Grok

Mentioned as a recent model utilizing post-norm layer normalization outside the residual stream.

Gemma

Cited as an example of a model that moved layer normalization after computation, outside the residual stream.

GPT-3

Used as an example of a model trained with GeLU activation, and later as a benchmark for sequential transformer blocks.

Used as an example for various architectural and hyperparameter choices, including GLUs, parallel blocks, and large feed-forward multipliers.

Chinchilla

Mentioned as a model that used its relative embeddings scheme.

Gemma 4

Mentioned as a model that uses proportional rope (p-rope) and has individual embeddings per layer.

DCLM

Mentioned as a model using the Z-loss trick for stabilizing output softmaxes.

OMO

Mentioned as a model using the Z-loss trick for stabilizing output softmaxes.

EdiFiX

Noted for using QK norm in multimodal models, demonstrating its stabilizing effect.

GMA 3

Mentioned as a model that uses the lowjet soft capping trick.

GMA 4

Mentioned as a model that uses the lowjet soft capping trick and has individual embeddings per layer.

People

Salazar and Yen

Cited for early research on layer norm placement and its impact on gradient propagation.

Jacob Develin

Pioneered the Z-loss trick for stabilizing output softmaxes in 2014.

Products

Falcon

Mentioned as an exception that uses a gated linear unit but differs from the dominant approach.

Chameleon

Cited for using QK norm in multimodal models, proving its effectiveness.

Companies

NVIDIA

Mentioned for doing systematic comparisons of stability interventions.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free