Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 4: Attention Alternatives

Stanford OnlineStanford Online
Education7 min read87 min video
Apr 15, 2026|1,612 views|72
Save to Pod
TL;DR

Linear attention and Mixture of Experts (MoE) offer radical gains in handling long contexts and model parameters, respectively, by rethinking core architectural components like attention and MLPs, enabling more efficient and powerful language models.

Key Insights

1

The computational cost of attention scales quadratically with sequence length (N^2), quickly becoming the bottleneck for long contexts, while the feed-forward network cost grows linearly.

2

Flash Attention offers a significant constant factor improvement in attention performance by optimizing memory transfers, though it doesn't solve the fundamental quadratic scaling issue.

3

Linear attention reorders the attention computation to achieve O(N) dependence on sequence length by leveraging the associativity of matrix multiplication, enabling RNN-like inference efficiency.

4

State space models like Mamba 2 and Gated DeltaNet build upon linear attention by introducing input-dependent gating mechanisms, allowing for controlled information flow while retaining parallel training and recurrent inference capabilities.

5

Mixture of Experts (MoE) replaces dense MLPs with sparsely activated 'experts', allowing models to have vastly more parameters (e.g., billions) while only incurring the computational cost of a fraction of them per token.

6

Training MoEs effectively relies on auxiliary losses (e.g., expert balancing) to prevent expert collapse and ensure balanced utilization across available experts and hardware, as highlighted by DeepSeek's architectural designs.

The quadratic cost of attention and the drive for longer contexts

The lecture begins by highlighting the increasing demand for longer context windows in language models, driven by the need for models to handle more information. This trend is visualized by a rush among leading AI vendors to offer larger context sizes. A critical challenge is the computational cost associated with attention mechanisms. While the feed-forward network (FFN) component's cost scales linearly with sequence length, the self-attention mechanism's all-to-all connections result in a quadratic (N^2) cost. This quadratic scaling quickly outpaces linear growth as context lengths increase, making attention the primary cost driver for long sequences. This necessitates exploring architectural changes to mitigate these costs, especially when aiming for contexts of millions of tokens.

Flash Attention: A systems-level win reducing memory overhead

Before delving into radical architectural changes, the lecture acknowledges the impact of systems engineering on efficiency. Flash Attention is presented as a prime example of how optimizing memory transfer overheads, rather than changing the core algorithmic complexity, can yield dramatic performance improvements. By rearranging the attention operation to avoid materializing large intermediate attention matrices, Flash Attention achieves factors of two improvements in performance and allows computations to fit in memory that would otherwise be impossible. While it doesn't address the quadratic complexity itself, it demonstrates the power of 'constant factor' improvements derived from systems-level optimizations.

Linear attention: Exploiting associativity for O(N) complexity

To fundamentally address the quadratic cost of attention, the lecture introduces 'linear attention.' The core idea is to reorder the standard attention computation QK^T V by leveraging the associativity of matrix multiplication. By removing the softmax normalization (or assuming it's an identity for conceptual simplicity), the computation can be rearranged to involve multiplying Keys (K) and Values (V) first, then multiplying by Queries (Q), or reformulating it in an RNN-like sequential manner. This shift changes the dependence from N^2 to O(N * D_v * D_k), where D_v and D_k are the dimensions of Value and Key matrices, respectively. These dimensions are typically much smaller than N (sequence length), making the overall complexity much more favorable. This linear form also enables an RNN-like inference mode where a fixed-size state is maintained, offering inference efficiency similar to RNNs while retaining the parallel computation benefits for training.

State space models: Gating and recurrence for expressive linear attention

Building on linear attention, state space models like Mamba 2 and Gated DeltaNet introduce more expressive updates. The key innovation is adding input-dependent gating mechanisms. In Mamba 2, a gate 'gamma(t)' modulated how much of the previous state is carried forward, inspired by LSTMs' ability to forget or retain information. Crucially, if these gates depend only on the current input (not the state), the models retain the duality of parallel training and efficient recurrent inference. Gated DeltaNet further enhances this with a 'beta(t)' gate, enabling more complex state updates, including projection-based updates that erase previous information in a specific direction. These models, like Mamba 2 and Gated DeltaNet, are often used in hybrid architectures with standard attention layers. For instance, NeMo Tron 3 combines Mamba 2 layers with softmax attention, and Qwen 3.5 utilizes a 3:1 Gated DeltaNet attention hybrid, demonstrating practical gains in throughput and efficiency at longer context lengths.

Sparse attention alternatives: DeepSeek Attention (DSA)

An alternative approach to managing attention costs, particularly for very long contexts, is sparse attention. DeepSeek Attention (DSA) employs a lightweight indexer to identify a small subset of relevant tokens from the long context. Standard full attention is then computed only on this reduced subset. The indexer itself uses QK inner products and weightings derived from preceding tokens, followed by a top-K selection. This method is notable because it can be applied during a model's long-context extension phase, after standard pre-training, without requiring full end-to-end training with the sparse mechanism. DeepSeek V3.2, which utilizes DSA, showed performance competitive with frontier models like Claude 4.5 Sonnet, with improved prefill and decoding scaling. GLM 5 also adopted DSA, demonstrating competitive performance even on long-context retrieval tasks. While the indexer still involves quadratic computation, it operates on lower-dimensional projections or with lower precision to reduce cost, and the final attention is on a much smaller K-sized subset, making it significantly cheaper than full attention on the entire sequence.

Mixture of Experts (MoE): Scaling parameters without proportional compute cost

The lecture then shifts to Mixture of Experts (MoEs), a technique aimed at increasing model capacity by vastly expanding the number of parameters without proportionally increasing computational cost. Instead of a single, dense Feed-Forward Network (FFN) in the transformer block, an MoE architecture uses multiple smaller FFNs, termed 'experts.' A routing mechanism directs each incoming token to a small subset (typically top-K) of these experts. This means that while the total parameter count can be enormous (e.g., billions), the compute cost per token (Forward Pass FLOPs) remains relatively low, comparable to a dense model with fewer parameters. This parameter-centric view allows models to scale effectively without prohibitive training or inference costs. Many recent high-performance models, such as DeepSeek V2 and Qwen, have adopted MoE architectures, showing improved training speed and performance compared to dense counterparts for similar compute budgets.

MoE design: Routing, expert balancing, and shared experts

Designing effective MoE models involves several key considerations, primarily focusing on the routing mechanism and expert utilization. The most common approach is 'token choice' top-K routing, where each token independently selects the top K experts it will be processed by. This contrasts with 'expert choice' where experts select tokens. Simple, lightweight routers, often a linear projection (inner product) between the input and expert weights, are typically used. A critical challenge is preventing 'route collapse' or 'expert starvation,' where a few experts become overloaded while others are underutilized. This is addressed by adding auxiliary 'balancing losses' during training, which penalize uneven token distribution across experts. DeepSeek MoE pioneered 'shared experts,' where a subset of experts are always active for all tokens, bypassing the router, alongside fine-grained, routed experts. This design, along with auxiliary losses for expert and device balancing, has become a widely adopted standard, seen in models like Qwen 1.5 MOE and GLM. While shared experts offer less parallelization benefit, they enable specialization of the routed experts.

Training MoEs: Heuristics for non-differentiable routing and stability

Training MoEs presents unique challenges due to the non-differentiable nature of top-K routing. While RL or stochastic perturbations can be used, practical implementations heavily rely on heuristic approaches, particularly auxiliary balancing losses. These losses encourage even distribution of tokens across experts and devices, preventing the 'rich get richer' phenomenon where a few popular experts dominate. The loss often involves multiplying the fraction of tokens assigned to an expert (F) by the probability mass allocated to it (P), creating a pressure to keep these values balanced. DeepSeek V3 introduced aux-loss-free balancing methods, but auxiliary losses remain crucial for stability and effective parameter utilization. Softmax in the router can also introduce stability issues, sometimes requiring higher precision (e.g., float32) for the router or using techniques like Z-loss, as demonstrated in OLMo's ablations. Fine-tuning MoEs can also be challenging due to their large parameter count, risking overfitting, leading to strategies like fine-tuning only attention layers or retraining with more data.

Common Questions

The primary challenge is the quadratic increase in computational cost with sequence length, making attention mechanisms prohibitively expensive for very long contexts.

Topics

Mentioned in this video

Software & Apps
FlashAttention

A systems-level optimization for attention mechanisms that rearranges operations to minimize memory transfer overhead, leading to significant performance improvements.

PyTorch

A deep learning framework used as a baseline for performance comparisons, showing dramatic improvements with FlashAttention.

Mamba

A family of state space models derived from state space theory, with Mamba 2 being an elaboration of linear attention by adding a gating mechanism.

NeMo Tron 3

A model that uses Mamba 2 as a lightweight layer, alternating it with softmax attention to manage inference cost and expressiveness.

Qwen 3

A model that NeMo Tron 3 is compared against, showing competitive performance.

GPT OSS

A model that NeMo Tron 3 is compared against, showing competitive performance.

Gated DeltaNet

A state space model that builds upon Mamba 2 by adding a second gate (beta T) to control updates and incorporates a projection for gate interactions.

LSTM

Mentioned as an inspiration for gating mechanisms in state space models, highlighting the importance of controlling information flow.

Qwen Next

A model with improved decoding throughput at large context lengths compared to Qwen 3, utilizing Gated DeltaNet architecture.

GLM 5

A highly regarded open-source model that adopted the DSA approach for efficient attention.

Claude 4.5 Sonnet

A frontier LLM model that DeepSeek V3.2 is compared against.

Gemini 3

A frontier LLM model that DeepSeek V3.2 is compared against.

Switch Transformer

An early Google paper on MoEs that introduced the concept of auxiliary loss for expert balancing.

OLMo

An open-source MoE study that provided ablations on Z loss for router stability and discussed load balancing loss.

H-Nets

Mentioned as an architecture that uses top-K selection and load balancing, similar to DSA.

MegaBlocks

An open-source MoE framework that addresses issues like expert queue overflow, offering dropless architectures.

Qwen

A Chinese AI company that contributed to early MoE popularization and upcycling techniques, demonstrating significant performance gains.

Qwen 1.5

A model resulting from upcycling a smaller Qwen model into a larger MoE, showcasing successful scale-up.

More from Stanford Online

View all 22 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free