Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 3: Architectures

Stanford OnlineStanford Online
Education6 min read90 min video
Apr 15, 2026|1,452 views|50
Save to Pod
TL;DR

Modern LLMs overwhelmingly adopt pre-norm layer normalization and RMS norm, ditching biases for efficiency, even if it means sacrificing some expressiveness. These seemingly small changes significantly boost training stability and speed, highlighting the critical interplay between architecture and system performance.

Key Insights

1

All modern language models have moved the layer norm outside of the residual stream, with most using pre-norm for improved gradient propagation and stability, a change motivated by the need to remove warm-up periods.

2

RMS Norm is universally adopted over Layer Norm in modern models due to significant speed improvements without a substantial loss of expressiveness, reducing runtime by up to 25% by omitting mean subtraction and bias terms, which are computationally inefficient.

3

Gated Linear Units (GLUs), such as Swigloo and Giggloo, are now standard in nearly all credible modern language models, consistently showing performance gains over non-GLU variants, even in parameter-matched comparisons.

4

RoPE (Rotary Position Embeddings) has become dominant for position dependence since 2024, offering a relative positional embedding solution by rotating word vectors based on their position, ensuring invariance to absolute positions.

5

The ratio between feed-forward network size and model dimension is a critical hyperparameter, with a common rule of thumb being 4x for non-GLU models and around 2.67x for GLU variants.

6

Grouped Query Attention (GQA) offers a favorable trade-off between inference cost and performance, significantly reducing memory access for the KV cache while maintaining near multi-head attention performance levels.

Standardization of Layer Norm and Normalization Techniques

A near-universal consensus has emerged regarding layer normalization in modern language models. The original Transformer's post-norm placement within the residual stream has been largely abandoned for pre-norm configurations, where layer normalization precedes computations. This shift is primarily driven by enhanced gradient propagation and training stability, allowing models to converge better and deeper without the need for warm-up periods, a significant improvement over the initial Transformer design. Furthermore, RMS Norm has almost entirely replaced Layer Norm, not due to representational superiority, but for substantial system efficiency gains. While Layer Norm involves mean subtraction and bias terms, RMS Norm simplifies this to just scaling, eliminating operations that, while small in terms of floating-point operations (flops), can account for up to 25% of runtime due to their memory-intensive nature and low arithmetic intensity. This optimization is crucial for keeping GPUs 'hot' with intense computations rather than wasting resources on data movement.

The Rise of Gated Linear Units (GLUs) in MLPs

The feed-forward network (FFN) component within transformer blocks has seen a significant evolution with the widespread adoption of Gated Linear Units (GLUs). While simpler activations like ReLU or GELU can train functional models, GLUs, such as Swigloo and Giggloo, are now standard in almost all credible modern language models. These gated activations introduce a multiplicative gating mechanism that modulates the output of a linear transformation, empirically leading to consistent performance improvements. A common practice when implementing GLUs is to scale down the feed-forward dimension by a factor of 2/3 to maintain a similar overall parameter count as non-GLU architectures. This parameter-matched comparison, often seen in research papers, consistently demonstrates the superiority of GLUs across various benchmarks, making them a de facto standard for enhancing model expressiveness and performance.

Rotary Position Embeddings (RoPE) for Relative Positional Information

Accurately encoding positional information is critical for attention mechanisms, which are inherently position-agnostic. While early models used sinusoidal or absolute position embeddings, RoPE (Rotary Position Embeddings) has become the dominant approach since approximately 2024. RoPE is a relative position embedding method designed to ensure that the inner product of embeddings depends only on the relative difference between positions, not their absolute locations. It achieves this by rotating word vectors in a high-dimensional space based on their position. By decomposing the high-dimensional rotation into repeated 2D rotations, RoPE effectively encodes relative positional information. This design choice avoids the cross-terms found in absolute embeddings and provides a more principled way to handle relative positions compared to simply modifying attention scores. The widespread adoption of RoPE underscores its effectiveness in capturing sequential dependencies crucial for language understanding.

Hyperparameter Trade-offs in Feed-Forward Networks

The ratio between the feed-forward network (FFN) hidden dimension and the model dimension is a key hyperparameter. For standard non-GLU models, a multiplier of 4x is a common and effective rule of thumb. When GLUs are used, this ratio typically decreases to around 2.67x to maintain parameter parity, reflecting the increased parameter count within GLU layers. While some models, like T5, have explored extreme ratios (e.g., 64x) for potential systems efficiency gains, subsequent versions often revert to more standard ratios, suggesting that radical deviations may not offer commensurate benefits or could be computationally inefficient. Research, such as the Kaplan et al. (2020) scaling laws paper, indicates a broad 'sweet spot' for this ratio, where performance is relatively stable between approximately 1x and 10x. This suggests that while deviations are possible, adhering to established ratios generally ensures good performance without significant risk.

Stability Interventions: Z-Loss and QK Norm

Ensuring training stability, especially for increasingly large and expensive models, is paramount. Two significant interventions target this: the Z-loss trick and QK Norm. The Z-loss trick, pioneered by Jacob Develin, aims to stabilize the output softmax by penalizing the deviation of the log-normalizer (log Z) from zero, preventing blow-ups caused by extremely large or small Z values. This technique is employed in models like Bichuan, DCLM, and OMO. QK Norm, originating from multimodal research, integrates layer normalization directly into the attention mechanism by normalizing the queries (Q) and keys (K) before their dot product. This ensures that the inputs to the attention softmax have a consistent scale, thereby stabilizing the attention computations and preventing degeneracies, a common issue in large models. While QK Norm appears to have minimal impact on performance but significantly enhances stability, it has become a standard practice in many large language models.

Efficient Attention Mechanisms: GQA and Sliding Window

Inference cost is a major concern, particularly for serving large language models. Multi-Head Attention (MHA), while performant during training, leads to a large KV cache, increasing memory access and slowing down inference. Multi-Query Attention (MQA) addresses this by sharing keys and values across all heads, drastically reducing KV cache size and improving arithmetic intensity but often at the cost of model expressiveness. Grouped-Query Attention (GQA) offers a compromise: it groups queries, keys, and values, allowing for a controlled reduction in KV cache size while retaining more heads for queries than MQA. This results in a favorable trade-off, achieving inference costs close to MQA with performance nearly matching MHA. Additionally, Sliding Window Attention, an older idea revived recently, alternates between full attention and localized windowed attention. This hybrid approach effectively manages long context dependencies without the quadratic cost of global attention over very long sequences, seen in models like Coher Command A and LLaMA 4, striking a balance between context length, performance, and computational efficiency.

Hyperparameter Flexibility and Converging Defaults

While the architecture and training landscape is diverse, many hyperparameters have converged on effective defaults. The ratio of feed-forward dimension to model dimension (around 4x for non-GLU, 2.67x for GLU), the head dimension equalling model dimension divided by the number of heads, and an aspect ratio (model dimension / layers) of approximately 100 are common. Surprisingly, regularization techniques like weight decay are still popular, not primarily for overfitting prevention (which is rare in single-pass SGD training) but as an optimization intervention that can interact beneficially with optimizers and learning rate decay. Vocabulary size shows a clear dichotomy: smaller sizes (around 30,000) for monolingual models and larger ones (100,000-200,000) for multilingual or production systems. This convergence on effective defaults suggests that while radical hyperparameter choices might be possible, established ranges offer robust performance and computational efficiency.

Common Questions

The most widely agreed-upon change is the placement of layer normalization outside the residual stream, typically before computations (pre-norm), which improves stability and gradient propagation compared to the original post-norm approach.

Topics

Mentioned in this video

More from Stanford Online

View all 22 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free