What are the challenges in scaling MoE models?

Scaling MoE models is challenging due to increased memory usage from parameter growth (e.g., 6-7x for MoE variants), overhead from router dispatching (permute/unpermute operations doubling activation memory), and potential for load imbalance if tokens are routed inefficiently to a single expert.

How does Megatron Core optimize MoE models?

Megatron Core offers optimizations such as expert model parallelism (placing experts on different GPUs), token dispatching with options for token dropping or padding, fused permutation operations to reduce memory overhead, and accelerated grouped GEMM operations via CUTLASS for efficient expert computation.

Can you upcycle existing large dense models into MoE?

Yes, you can upcycle large dense models into MoE by copying the MLP layers to create experts and randomly initializing a router. This process aims to maintain the original model's forward pass to prevent catastrophic forgetting and can achieve better accuracy than further training the dense model for the same FLOPs.

What is the difference between Switch Transformer's routing and the upcycling approach?

The original Switch Transformer uses softmax first, then top-k, ensuring probabilities sum to one. The upcycling approach described uses top-k, then softmax, which can lead to smaller output scales and potential catastrophic forgetting if not handled. A simple scaling of MLP output by (number of experts / top-k) can resolve this.

What are the key findings for upcycling LLMs?

Key findings indicate that a high learning rate is crucial for successful upcycling, ideally using the original peak learning rate from pre-training. Using a constant small learning rate results in high cosine similarity to the base model, suggesting less adaptation compared to a high learning rate which leads to ~0.7 similarity.

What is the optimal number of experts for upcycling?

Experiments suggest that around 64 experts is a sweet spot for MoE upcycling. Increasing the number of experts beyond 64 tends to yield diminishing returns.

How does upcycling compare to continued training of dense models?

Upcycling can provide a 4-5% improvement in metrics like validation loss and MML over continued training of a dense model, while maintaining similar FLOPs. This suggests significant efficiency gains compared to training larger models from scratch, highlighting the importance of data and architecture choices.

Key Moments

[Paper Club] Upcycling Large Language Models into Mixture of Experts

Latent Space Podcast

Science & Technology4 min read40 min video

Oct 29, 2024|570 views|11|4

Save to Pod

Key Moments

TL;DR

Upcycles large language models into Mixture of Experts for better performance with same compute.

Key Insights

Mixture of Experts (MoE) increases model parameters without a proportional increase in compute by sparsely activating experts per token.

Megatron-Core MoE optimizes MoE training and inference through techniques like expert model parallelism, fused operations, and efficient token dispatching.

Upcycling dense LLMs into MoE models can yield better accuracy than further training the dense model, especially on large datasets.

A key to successful upcycling is maintaining similar forward pass behavior to the original dense model to avoid catastrophic forgetting.

The order of Top-K and Softmax operations in MoE routing significantly impacts performance, with a specific scaling factor needed for upcycled models.

Fine-grained MoE with a higher number of experts (e.g., 64) shows diminishing returns beyond a certain point.

INTRODUCTION TO MIXTURE OF EXPERTS

The presentation begins by introducing the concept of Mixture of Experts (MoE) as a method to scale Large Language Models (LLMs). Traditional LLMs utilize an FFN layer comprising two linear layers. MoE transforms this by replacing the FFN with multiple expert FFNs, where a router selectively activates a few experts for each token. This approach allows for a significant increase in model parameters, thereby enhancing knowledge capacity, without a corresponding increase in computational cost (flops), addressing the challenge of limited compute resources.

MECHANISMS AND CHALLENGES OF MOE

An MoE layer involves three key steps: routing, permutation, and computation. The router, typically a learnable matrix, assigns probabilities to tokens, determining which experts to activate. Permutation involves reordering tokens for efficient expert processing, followed by computation within the selected experts. Finally, an unpermutation step re-aligns tokens to their original order, with router probabilities applied as scaling factors. Scaling MoE models during training poses challenges, including substantial memory pressure due to increased parameters, overhead from router dispatching, activation memory duplication, and potential imbalance issues if tokens are not evenly distributed among experts.

MEGATRON-CORE MOE OPTIMIZATIONS

To address MoE challenges, NVIDIA developed Megatron-Core MoE. This library offers optimizations for both training and inference. Key features include expert model parallelism, which distributes experts across GPUs to save memory and accelerate computation. It also supports various token dispatching strategies, such as token dropping with padding and the alternative of token choice without dropping, catering to different performance and efficiency trade-offs. Fused permutation operations and cutlass group GEMM are implemented to reduce memory overhead and improve computational efficiency, especially for models with a large number of experts.

ROUTING STRATEGIES IN MOE

The routing mechanism in MoE is crucial for its efficiency. Megatron-Core MoE offers optimizations like OA and Synor. OA is a token-centric approach, while Synor is designed for expert choice routing. The presentation also discusses the difference between common Top-K routing and alternative strategies like Top-P sampling, which could introduce more diversity but also complexity. Expert choice, where experts select tokens rather than tokens selecting experts, is highlighted as a strategy particularly suited for vision models due to their lack of causal masking.

UPCYCLING DENSE MODELS INTO MOE

A significant contribution discussed is the 'upcycling' of existing dense LLMs into MoE models. This process involves copying the MLP layer into multiple expert copies and randomly initializing a router. A critical technique for stable upcycling, inspired by Mixr, is the swapping of the Top-K and Softmax operators. By applying Top-K first, followed by Softmax on the selected experts' logits, and then applying a scaling factor derived from the number of experts and Top-K value, the upcycled model can maintain similar forward pass behavior to the original dense model, preventing catastrophic forgetting.

PERFORMANCE GAINS AND EXPERIMENTAL RESULTS

Experimental results demonstrate that upcycled MoE models can achieve better accuracy than simply continuing to train the original dense model for the same number of flops. For instance, upcycling an 8x15B model on one trillion tokens showed a notable improvement in validation loss and MML benchmarks. Key factors for successful upcycling include using a higher learning rate than typically used for fine-tuning, and analysis suggests that a higher learning rate leads to lower cosine similarity between the base and upcycled models, indicating significant adaptation. The number of experts, with 64 showing as a sweet spot, also impacts performance, with diminishing returns beyond this point.

FINE-GRAINED MOE AND SCALING

The concept of fine-grained MoE is explored, where models use a larger number of experts, each being smaller. This increases representational power by allowing more expert combinations. For example, a model might segment its experts using parameters like E (expansion factor), G (granularity), and T (routing tokens). To maintain forward pass consistency in fine-grained MoE upcycling, especially when experts are sharded, the router weights are initialized to be duplicated. This ensures consistent probability distributions across virtual groups, leading to identical Top-K selections and preserving the dense model's behavior.

IMPLICATIONS FOR FUTURE LLM DEVELOPMENT

The upcycling technique offers a path to achieving state-of-the-art performance without the prohibitive cost of training massive MoE models from scratch. It leverages existing large dense models, providing a significant boost in accuracy (e.g., 4-5% on MML) with substantially less compute, potentially representing a 1.7x larger model or twice the power for an eighth of the pre-training compute. This suggests a more efficient scaling trajectory for future LLM development, emphasizing data efficiency and architectural innovation.

Mentioned in This Episode

●Software & Apps

●Companies

●Concepts

Common Questions

A Mixture of Experts (MoE) model transforms the Feed-Forward Network (FFN) layer in traditional Transformers into multiple expert FFNs. Each input token then selectively activates a few of these experts, chosen by a router, enhancing model size and capability without a proportional increase in compute.

Topics

Mixture Of Experts AI & Machine Learning Technology & Innovation Programming & Software Large Language Models LLM Training Transformer Architecture Model Optimization Model Upcycling Deep Learning Frameworks

Mentioned in this video

Software & Apps

Megatron LM

An open-source library on GitHub that accelerates LLM training and inference, including MoE models.

Llama

A large language model mentioned as an example of a large dense model that can be upcycled into MoE.

LLaMA 2

Mentioned in the context of fine-tuning and comparing cosine similarity with upcycled models.

Megatron Core

A core component of Megatron-LM, providing optimized implementations for various LLM architectures and MoE.

Nemo

A framework providing high-level interfaces for training LLMs, customizable with configurations.

Koala V2

A recent MoE model that uses 64 experts.

Mixtral 8x7B

A specific MoE model architecture mentioned in the context of upcycling and learning rate experiments.

Llama Chat

A variant of Llama mentioned in comparison to Llama base regarding cosine similarity.

DeepSpeed

A deep learning optimization library that has a MoE variant (DeepSpeed-MoE V2) with a large number of experts.

Organizations

Meta AI

Previously Facebook, conducted research on training dense models first and then combining specialized experts into MoE.

Companies

NVIDIA

The company where the speaker, Ethan, works, involved in developing LLM scaling solutions.

Google

The originators of the Switch Transformer model, which uses softmax followed by top-k for expert selection.

DeepSeek

A model that uses a large number of experts (128 or 160).

Concepts

Switch Transformer

The first model to surpass one trillion model size and an early example of MoE architecture.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free