Aug 04 2025

Router Wars: Which MoE Routing Strategy Actually Works

Here’s what nobody tells you about MoE: the router can single-handedly destroy your model. You can have perfect expert network architecture, tuned hyperparameters, and unlimited compute, but if your router collapses, you’re back to dense model performance regardless of number of experts you choose.

The router’s job sounds simple – it needs to decide which expert handles each token. In practice, it’s where most MoE implementations go wrong. With wrong strategy you can spend weeks debugging and be completely lost.

So which routing strategy should you use and what to expect from it? Let’s examine the most common approaches, their real-world tradeoffs, and what works in practice.

The Routing Landscape: Oh So Many Flavors…

Let’s first address the elephant in the room: why should you care about routing techniques from 2017-2022 when there are dozens of newer methods being published every ~~month~~ week? Because every single production MoE model today is built on top of them!

In Table 1 you can see these fancy names such as shared experts, capacity factors, adaptive auxiliary (aux) loss or expert bias (and there is many more out there!) – they’re just engineering tricks layered on top of core methods developed almost a decade ago. What are they trying to fix? Two fundamental problems: expert utilization (i.e. are all your experts actually being used?) and expert specialization (i.e. are your experts learning different things, or just copying each other introducing redundancy?).

What about DeepSeek-V3’s novel routing method? It’s a vanilla learned routing with aux loss on the sequence level and extra engineering tricks to improve expert utilization. Qwen3’s routing breakthrough? It’s also a learned routing method with aux loss, however, on the global batch level – in other words, it simply relaxes load balancing regularization a bit more to make experts more specialized.

Want to pick something off the shelf? Go ahead, use Table 1 and close this guide. But, when your shiny new routing method fails at 3am in the morning during a multi-million-dollar training run, you’ll be debugging one of the core approaches underneath all engineering layers that we will explore in a greater depth in the rest of this guide.

The Three Fundamental Approaches

Hash Routing: The Safe but Boring Choice

Hash routing (Roller et al. 2021) is the most straightforward approach – the router from moe_layer introduced in (Soboleva 2025) simply assigns tokens using:

where N is expert count and token_id is the token index in the vocabulary. It’s deterministic, easy to understand, and impossible to break. It also doesn’t work very well.

Looking at Figure 2a, hash routing maintains perfect load balancing across all layers – every expert gets the same number of tokens (high expert utilization). But Figure 3a shows why this doesn’t help: experts end up learning overlapping, similar representations because token assignments are completely disregarding the token’s context (low expert specialization). A token representing “function” in code and “function” in a math paper might have similar token_ids but need completely different processing. Hash routing can’t tell the difference.

As a result with 16 Experts, hash routing gives you only 1.5% loss improvement (compared to Chinchilla-optimal dense scaling (Hoffmann et al. 2022) at a fixed compute) and it barely increases further with more experts (Figure 1).

Learned Routing: The Industry Standard

With hash routing the problem is clear: ignoring context kills performance. Learned routing, first introduced in (Shazeer et al. 2017), takes the opposite approach – it learns how experts should handle each token. Concretely, the router from moe_layer is now a learned linear layer that outputs logits for each expert. To penalize the router for potential imbalances, we add an auxiliary loss:

where f represents the fraction of tokens sent to each expert, p is your routing probability and coeff is how hard you want to enforce balance.

The results are impressive. With 16 experts, learned routing delivers a solid 4% loss improvement – nearly 3x larger gains than hash routing (Figure 1)! This is why every production MoE system uses some variant of learned routing (Table 1).

The magic happens through specialization. Figure 3b shows learned routing creating clean, separated expert representations – each expert carves out its own specialty instead of producing overlapping patterns like in hash routing.

But there is a problem: router collapse! Figure 2b shows that while middle layers balance well, early and late layers funnel most tokens to just 1-2 experts. This creates load balancing nightmares for distributed training (for example, when using expert parallelism (DeepSeek-AI et al. 2024)). This is why Deepseek-V3 (DeepSeek-AI et al. 2024) and Qwen2 (Qwen et al. 2024) MoE models use shared experts (always activated experts alongside the routed ones).

Sinkhorn Routing: The Per-Layer Load Balancer

Learned routing delivers great performance but suffers from router collapse in some layers. Hash routing has perfect load balancing through all layers but ignores context entirely, leading to suboptimal performance. What if you could get learned routing-level performance with better per layer load balancing control? This is where Sinkhorn routing comes in (Clark et al. 2022).

Learned routing controls load balancing globally across layers – but individual layers can still collapse if others compensate. Sinkhorn gets rid of auxiliary loss altogether and prevents imbalance by regularizing each layer independently. It iteratively alternates two normalizations:

first, it ensures equal load per expert:

then it normalizes distribution for each token:

As a result, we can achieve hash-level load balance across all layers (Figure 2c) with learned routing-level performance quality (Figure 1).

But here’s an important insight: better load balancing ≠ better learning. Figure 3c shows that enforcing strict per-layer balance limits how much experts can differentiate themselves compared to learned routing’s cleaner separation (Figure 3b). Sinkhorn essentially takes collapsed layers (with only a few experts effectively utilized) and forcibly moves tokens from overutilized experts to underutilized ones. You're not getting better token-expert matching - you're just solving load balancing problem.

You might wonder, however, why Sinkhorn is not the industry standard since it gets the best parts of both learned and hash routing? Unfortunately, it is significantly harder to scale in practice and has seen limited industry adoption due to implementation complexities. Here’s an important one: Sinkhorn’s iterative algorithm detaches gradients, breaking router training.

The fix is surprisingly simple but somewhat buried in the literature. Let’s use Sinkhorn weights:

to select which experts to route to, but compute expert’s mixing weights from the original (non-detached) logits:

Most people miss that and wonder why their router only learns how to load balance.

Finding a routing mechanism that simultaneously improves both expert utilization and specialization remains an important open research challenge (Chi et al. 2022; Qiu et al. 2025). Router inefficiencies become even more pronounced for large numbers of experts (hundreds, thousands). As the core component of the MoE system, if the router collapses, the MoE scaling advantages can vanish entirely.

Knowing which router to pick is just the beginning. Even with the right choice, MoE training is fragile. Router collapse, load imbalance, vanishing gradients, and all other mysterious training instabilities can appear even with implementations that look alright. Your loss curve is going down, but your router learns to route everything to only one expert and you end up with your baseline dense model despite increased model capacity.

Sound familiar? In “Debugging Dead MoE Models: A Step-by-Step Guide”, we’ll build a complete MoE model from scratch and debug these issues step-by-step together. You will learn how to fix subtle bugs that make MoE training so much harder than training dense models.

Questions? Find me at: https://soboleva-daria.github.io/

Footnotes

[1] Why different projections in Fig 3a vs 3b? Hash lacks routing weights, so we use PCA. Learned routing has router weights that define decision boundaries.

References

Chi, Zewen, Li Dong, Shaohan Huang, et al. 2022. “On the Representation Collapse of Sparse Mixture of Experts.” arXiv Preprint arXiv:2204.09179.

Clark, Aidan, Diego de las Casas, Aurelia Guy, et al. 2022. “Unified Scaling Laws for Routed Language Models.” arXiv Preprint arXiv:2202.01169.

DeepSeek-AI, Aixin Liu, Bei Feng, and others. 2024. “DeepSeek-V3 Technical Report.”

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, et al. 2022. “Training Compute-Optimal Large Language Models.”

Qiu, Zihan, Zeyu Huang, Bo Zheng, et al. 2025. “Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models.”

Qwen, An Yang, Baosong Yang, and others. 2024. “Qwen2.5 Technical Report.”

Roller, Stephen, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. 2021. “Hash Layers For Large Sparse Models.”

Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, et al. 2017. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” arXiv Preprint arXiv:1701.06538.

Soboleva, Daria. 2025. MoE Fundamentals: Sparse Models Are the Future. Cerebras Blog. https://www.cerebras.ai/blog/moe-guide-why-moe.

MoE Fundamentals: Sparse Models are the Future

Blog