Catalog NLP & Language Models Mixture of Experts

Mixture of Experts (MoE) Routing Visualizer

Modern LLMs like Mixtral and DeepSeek-V3 replace the dense feed-forward block with a router that sends each token to only $K$ of $N$ experts. Watch a single token get routed step by step, then stream a batch through and see what happens when one expert hogs the traffic.

Section 1 · Why MoE

From a dense FFN to a sparse mixture

A transformer's feed-forward block is the dominant compute cost per token. MoE keeps the same per-token work small by activating only $K$ experts, while letting the model's total parameter count grow with $N$. The router is a tiny linear layer; the experts are full-size FFNs.

Dense FFN
token $x$
FFN
$h_{ff}$ hidden, $\approx 2 d \cdot h_{ff}$ params
output $y$
Every token passes through the same FFN. Activated params = total params.
MoE Layer
token $x$
Router
$W_r \in \mathbb{R}^{N\times d}$, top-$K$
$E_1$
$E_2$
$E_3$
$E_4$
solid edge = chosen by top-$K$
$y = \sum_{i \in \text{top-}K} w_i \cdot E_i(x)$
Each token activates only $K$ experts. Activated params $\approx \tfrac{K}{N} \cdot$ total params.
Compute calculator
Compare an MoE layer to its dense baseline at the same per-expert hidden size.
Dense FFN total params
MoE total params
Activated per token

Mixtral 8×7B uses $N=8, K=2$: ~47B total params but only ~13B activated per token. Param counts here use the SwiGLU-style estimate $3 \cdot d \cdot h_{ff}$ per FFN.

Section 2 · The router, one token at a time

How the router decides

The router is a single linear layer $W_r \in \mathbb{R}^{N \times d}$. For a token $x$ it produces $N$ logits, softmaxes them, picks the top-$K$, and renormalizes the kept weights so they sum to 1. Step through it below.

Step 0 / 5
0 1 2 3 4 5
Token vector $x$
token #—
Router weights $W_r$ (preview)
$N \times d$ shown; cell shade = magnitude
step: input
A new token vector is sampled. It is the input to the router.
Logits / probabilities / weights
Experts
Output
$y = $ — (advance to step 5)
Why renormalize?

After top-$K$ selection, the kept softmax probabilities no longer sum to 1. If we used them directly, the router would also be implicitly down-weighting every expert output. Renormalizing means "given the experts I picked, here is the relative confidence between them" — and gives the layer the same output magnitude as a dense FFN at initialization. Mixtral and Switch Transformer both renormalize.

Section 3 · Batch routing & load balance

Why load balancing matters

Stream a batch of tokens through the router and watch the per-expert load fill up. A trained router spreads tokens evenly. An untrained or collapsed router sends almost everything to one expert — so you pay for $N$ experts but get the throughput of one. With a finite capacity factor, tokens that overflow their expert are dropped.

Router mode:
Routing stream
Each token is routed to its top-$K$ experts. Lines fade in as routing decisions are made.
Per-expert load
red marker = capacity = ⌈capacity_factor × T·K / N⌉
Routed
0
Dropped
0
Imbalance
0.00
Imbalance = coefficient of variation (std / mean) of the router's pre-capacity routing decisions. 0 is perfect. Computing it before capacity matters because capping by capacity hides the bias — the Switch Transformer aux loss penalizes the pre-capacity distribution directly.
Try this
  • Switch to Collapsed: one expert eats almost every token; capacity is hit fast and the rest are dropped.
  • Switch to Biased: a single expert is over-preferred but others still see traffic.
  • Lower the capacity factor: even a balanced router drops a few tokens because of variance — the cost of guaranteed throughput.
  • Raise $K$ from 1→2: redundant routing absorbs imbalance at the cost of more compute per token.