Key Moments
[Paper Club] Upcycling Large Language Models into Mixture of Experts
Key Moments
Upcycles large language models into Mixture of Experts for better performance with same compute.
Key Insights
Mixture of Experts (MoE) increases model parameters without a proportional increase in compute by sparsely activating experts per token.
Megatron-Core MoE optimizes MoE training and inference through techniques like expert model parallelism, fused operations, and efficient token dispatching.
Upcycling dense LLMs into MoE models can yield better accuracy than further training the dense model, especially on large datasets.
A key to successful upcycling is maintaining similar forward pass behavior to the original dense model to avoid catastrophic forgetting.
The order of Top-K and Softmax operations in MoE routing significantly impacts performance, with a specific scaling factor needed for upcycled models.
Fine-grained MoE with a higher number of experts (e.g., 64) shows diminishing returns beyond a certain point.
INTRODUCTION TO MIXTURE OF EXPERTS
The presentation begins by introducing the concept of Mixture of Experts (MoE) as a method to scale Large Language Models (LLMs). Traditional LLMs utilize an FFN layer comprising two linear layers. MoE transforms this by replacing the FFN with multiple expert FFNs, where a router selectively activates a few experts for each token. This approach allows for a significant increase in model parameters, thereby enhancing knowledge capacity, without a corresponding increase in computational cost (flops), addressing the challenge of limited compute resources.
MECHANISMS AND CHALLENGES OF MOE
An MoE layer involves three key steps: routing, permutation, and computation. The router, typically a learnable matrix, assigns probabilities to tokens, determining which experts to activate. Permutation involves reordering tokens for efficient expert processing, followed by computation within the selected experts. Finally, an unpermutation step re-aligns tokens to their original order, with router probabilities applied as scaling factors. Scaling MoE models during training poses challenges, including substantial memory pressure due to increased parameters, overhead from router dispatching, activation memory duplication, and potential imbalance issues if tokens are not evenly distributed among experts.
MEGATRON-CORE MOE OPTIMIZATIONS
To address MoE challenges, NVIDIA developed Megatron-Core MoE. This library offers optimizations for both training and inference. Key features include expert model parallelism, which distributes experts across GPUs to save memory and accelerate computation. It also supports various token dispatching strategies, such as token dropping with padding and the alternative of token choice without dropping, catering to different performance and efficiency trade-offs. Fused permutation operations and cutlass group GEMM are implemented to reduce memory overhead and improve computational efficiency, especially for models with a large number of experts.
ROUTING STRATEGIES IN MOE
The routing mechanism in MoE is crucial for its efficiency. Megatron-Core MoE offers optimizations like OA and Synor. OA is a token-centric approach, while Synor is designed for expert choice routing. The presentation also discusses the difference between common Top-K routing and alternative strategies like Top-P sampling, which could introduce more diversity but also complexity. Expert choice, where experts select tokens rather than tokens selecting experts, is highlighted as a strategy particularly suited for vision models due to their lack of causal masking.
UPCYCLING DENSE MODELS INTO MOE
A significant contribution discussed is the 'upcycling' of existing dense LLMs into MoE models. This process involves copying the MLP layer into multiple expert copies and randomly initializing a router. A critical technique for stable upcycling, inspired by Mixr, is the swapping of the Top-K and Softmax operators. By applying Top-K first, followed by Softmax on the selected experts' logits, and then applying a scaling factor derived from the number of experts and Top-K value, the upcycled model can maintain similar forward pass behavior to the original dense model, preventing catastrophic forgetting.
PERFORMANCE GAINS AND EXPERIMENTAL RESULTS
Experimental results demonstrate that upcycled MoE models can achieve better accuracy than simply continuing to train the original dense model for the same number of flops. For instance, upcycling an 8x15B model on one trillion tokens showed a notable improvement in validation loss and MML benchmarks. Key factors for successful upcycling include using a higher learning rate than typically used for fine-tuning, and analysis suggests that a higher learning rate leads to lower cosine similarity between the base and upcycled models, indicating significant adaptation. The number of experts, with 64 showing as a sweet spot, also impacts performance, with diminishing returns beyond this point.
FINE-GRAINED MOE AND SCALING
The concept of fine-grained MoE is explored, where models use a larger number of experts, each being smaller. This increases representational power by allowing more expert combinations. For example, a model might segment its experts using parameters like E (expansion factor), G (granularity), and T (routing tokens). To maintain forward pass consistency in fine-grained MoE upcycling, especially when experts are sharded, the router weights are initialized to be duplicated. This ensures consistent probability distributions across virtual groups, leading to identical Top-K selections and preserving the dense model's behavior.
IMPLICATIONS FOR FUTURE LLM DEVELOPMENT
The upcycling technique offers a path to achieving state-of-the-art performance without the prohibitive cost of training massive MoE models from scratch. It leverages existing large dense models, providing a significant boost in accuracy (e.g., 4-5% on MML) with substantially less compute, potentially representing a 1.7x larger model or twice the power for an eighth of the pre-training compute. This suggests a more efficient scaling trajectory for future LLM development, emphasizing data efficiency and architectural innovation.
Mentioned in This Episode
●Software & Apps
●Companies
●Concepts
Common Questions
A Mixture of Experts (MoE) model transforms the Feed-Forward Network (FFN) layer in traditional Transformers into multiple expert FFNs. Each input token then selectively activates a few of these experts, chosen by a router, enhancing model size and capability without a proportional increase in compute.
Topics
Mentioned in this video
An open-source library on GitHub that accelerates LLM training and inference, including MoE models.
A large language model mentioned as an example of a large dense model that can be upcycled into MoE.
Mentioned in the context of fine-tuning and comparing cosine similarity with upcycled models.
A core component of Megatron-LM, providing optimized implementations for various LLM architectures and MoE.
A framework providing high-level interfaces for training LLMs, customizable with configurations.
A recent MoE model that uses 64 experts.
A specific MoE model architecture mentioned in the context of upcycling and learning rate experiments.
A variant of Llama mentioned in comparison to Llama base regarding cosine similarity.
A deep learning optimization library that has a MoE variant (DeepSpeed-MoE V2) with a large number of experts.
More from Latent Space
View all 182 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free