Mixture of Experts

hub

Core Concepts

Sparse Activation

Only a fraction of the model's total parameters are used to process each token, dramatically reducing compute per forward pass.

How it works

A standard dense transformer activates every parameter for every token. In an MoE model, each transformer block's feed-forward network is replaced by multiple parallel "expert" FFNs. A routing mechanism selects a small subset (typically 1 or 2) of experts per token. Only the selected experts run their computation. Total parameter count can be very large (hundreds of billions), but active parameters per token remain small (often 10 to 25% of the total).

Why it matters

Sparse activation breaks the link between model capacity and compute cost. A model with 8x more parameters than a dense model can run at similar FLOPs per token, because each token only uses a fraction of the parameters. This allows models to store more knowledge without proportionally increasing inference cost.

Expert Networks

Individual feed-forward networks that learn to specialise in different types of inputs during training.

How it works

Each expert is a standard two-layer FFN (identical architecture to the FFN in a dense transformer). During training, experts naturally specialise because the router learns to send similar tokens to the same expert. Different experts may learn to handle different languages, topics, syntactic structures, or reasoning patterns. Experts share the same attention layers, and only the FFN portion is replicated.

Why it matters

Specialisation means each expert can develop deep competence in its domain rather than being a generalist. This is analogous to how organisations use specialist teams: the total capacity is greater than any single person, and each query is routed to the most relevant specialist.

Gating / Router

A learned function that decides which experts process each token.

How it works

The router is typically a small linear layer that takes a token's hidden representation and produces a score for each expert. These scores are normalised (via softmax) to produce routing probabilities. The top-K experts (by probability) are selected. The expert outputs are combined as a weighted sum using the routing probabilities as weights. The router is trained end-to-end with the rest of the model.

Why it matters

The router is the critical component that makes MoE work. A good router ensures tokens reach the most relevant experts while distributing load evenly. Poor routing leads to expert collapse (all tokens go to one expert) or load imbalance (some experts are overwhelmed while others are idle).

route

Routing Mechanisms

Top-K Token Choice

The standard routing approach where each token selects its top-K experts.

How it works

For each token, the router produces scores for all experts. The top-K experts (usually K=1 or K=2) are selected. The token is processed by each selected expert independently. The outputs are combined as a weighted sum using the normalised routing scores. With top-2 routing, each token gets two "opinions" that are blended. An auxiliary load-balancing loss encourages even distribution across experts.

Trade-offs

Simple and widely used. Top-2 provides better quality than top-1 at roughly double the compute. The main challenge is load balancing: without auxiliary losses, the router tends to converge on sending most tokens to a small number of "popular" experts, wasting capacity.

Expert Choice

Inverts the routing direction so experts select their tokens rather than tokens selecting experts.

How it works

Instead of each token choosing its top-K experts, each expert chooses its top-K tokens from the current batch. This guarantees perfect load balance because every expert processes exactly K tokens. The expert scores all tokens in the batch and selects the highest-scoring ones. A token may be selected by zero, one, or multiple experts.

Trade-offs

Guarantees perfect load balance, eliminating the need for auxiliary balancing losses. However, some tokens may not be selected by any expert (dropped) or may be selected by many experts (redundant computation). Works best with large batch sizes where experts have many tokens to choose from.

Hash / Deterministic Routing

Assigns tokens to experts using fixed rules rather than learned routing.

How it works

Tokens are assigned to experts based on their position, hash of their token ID, or other deterministic functions. No learned router parameters. The assignment is fixed regardless of context. Random routing assigns tokens uniformly at random. Modular routing assigns based on position modulo number of experts.

Trade-offs

Zero routing overhead and perfect load balance. However, the assignment does not adapt to content, so a token about mathematics gets the same expert regardless of whether that expert has specialised in math. Generally underperforms learned routing but is useful as a baseline and in very large-scale systems where routing overhead matters.

warning

Training Challenges

Load Balancing

Ensuring all experts receive a roughly equal share of tokens during training.

The problem

The router tends to develop preferences for certain experts early in training. Once an expert gets more tokens, it improves faster, which makes the router send it even more tokens. This creates a positive feedback loop where a few experts do all the work while the rest are starved of training signal and become useless.

Solutions

Auxiliary load-balancing loss adds a penalty proportional to how unevenly tokens are distributed across experts. This loss is weighted (typically 0.01 to 0.1) and added to the main language modelling loss. Expert capacity factors cap the maximum number of tokens any single expert can receive. Batch-level balancing ensures each batch distributes tokens roughly evenly.

Expert Collapse

A failure mode where the model converges to using only a small subset of experts, effectively reducing to a smaller dense model.

The problem

If the router learns to consistently ignore certain experts, those experts stop receiving gradient updates and their weights stagnate. The router then has even less reason to use them, creating irreversible collapse. In severe cases, a model with 64 experts may effectively use only 4 to 8, wasting most of its parameters.

Solutions

Jitter noise adds small random perturbations to router logits during training, giving "cold" experts occasional tokens to learn from. Expert dropout randomly drops experts during training, forcing the model to not over-rely on any single expert. Periodic expert re-initialisation resets unused experts' weights to give them a fresh start.

Communication Overhead

Moving tokens between devices when experts are distributed across multiple accelerators.

The problem

In large MoE models, experts are spread across different GPUs or TPUs. When a token needs to reach an expert on another device, it must be transferred across the interconnect. With thousands of experts and thousands of tokens per batch, this creates an all-to-all communication pattern that can become a bottleneck, especially when interconnect bandwidth is limited.

Solutions

Expert parallelism places groups of experts on the same device and routes tokens accordingly. Capacity factors limit how many tokens cross device boundaries. Hierarchical routing first selects a device, then selects an expert on that device, reducing cross-device traffic. Efficient all-to-all collectives optimise the communication pattern.

stars

Notable MoE Models

Switch Transformer

Google Research (2021). Simplified MoE by using top-1 routing, showing that a single expert per token is sufficient.

Architecture details

Replaces the FFN in each transformer layer with a Switch layer containing many experts (up to 2048). Each token is routed to exactly one expert (top-1). Uses a simplified load-balancing auxiliary loss. Scales to over 1 trillion parameters. Trained on C4 dataset.

Significance

Demonstrated that MoE can scale efficiently and that top-1 routing works well despite its simplicity. Showed 4 to 7x pre-training speedups over dense models with the same compute budget. Established practical engineering patterns for training MoE models at scale.

Mixtral 8x7B

Mistral AI (2023). A high-quality open-weights MoE model that matched or exceeded much larger dense models.

Architecture details

8 experts per MoE layer, top-2 routing (2 of 8 experts active per token). 46.7B total parameters, 12.9B active parameters per token. 32K token context window. Uses the same attention layers as Mistral 7B but replaces FFN layers with MoE layers. Sliding window attention for efficiency.

Significance

Proved that MoE is practical for open-source deployment. Matched GPT-3.5 and LLaMA 2 70B performance while using only 12.9B active parameters per forward pass. Demonstrated that the MoE approach works for production-quality instruction-tuned models, not just research experiments.

DeepSeek-V2 / V3

DeepSeek (2024). Pushed MoE design further with fine-grained experts and innovative architectural choices.

Architecture details

DeepSeek-V2 uses 160 fine-grained experts with top-6 routing, plus 2 shared experts that process every token. Multi-head Latent Attention (MLA) compresses key-value caches for efficient inference. DeepSeek-V3 scales to 671B total parameters (37B active) with 256 routed experts per layer. Uses auxiliary-loss-free load balancing via bias terms.

Significance

Showed that many small experts (fine-grained) can outperform fewer large experts. The shared expert design ensures a baseline of common knowledge while routed experts handle specialisation. Achieved frontier model performance at a fraction of the training cost compared to dense alternatives.

Grok-1

xAI (2024). One of the largest open-weights MoE models, released as open source.

Architecture details

314B total parameters. Uses a mixture of 8 experts with top-2 routing. 64 transformer layers. Trained on a large proprietary dataset. Released with open weights under the Apache 2.0 licence.

Significance

Notable for its scale and open release. Demonstrated that very large MoE models can be made available to the research community. Provided a reference point for MoE architectures at the 300B+ parameter scale.

See Sliding Window Attention and Multi-Head Latent Attention (MLA) on the Architectures page.