From a dense FFN to a sparse mixture
A transformer's feed-forward block is the dominant compute cost per token. MoE keeps the same per-token work small by activating only $K$ experts, while letting the model's total parameter count grow with $N$. The router is a tiny linear layer; the experts are full-size FFNs.
Mixtral 8×7B uses $N=8, K=2$: ~47B total params but only ~13B activated per token. Param counts here use the SwiGLU-style estimate $3 \cdot d \cdot h_{ff}$ per FFN.
How the router decides
The router is a single linear layer $W_r \in \mathbb{R}^{N \times d}$. For a token $x$ it produces $N$ logits, softmaxes them, picks the top-$K$, and renormalizes the kept weights so they sum to 1. Step through it below.
Why renormalize?
After top-$K$ selection, the kept softmax probabilities no longer sum to 1. If we used them directly, the router would also be implicitly down-weighting every expert output. Renormalizing means "given the experts I picked, here is the relative confidence between them" — and gives the layer the same output magnitude as a dense FFN at initialization. Mixtral and Switch Transformer both renormalize.
Why load balancing matters
Stream a batch of tokens through the router and watch the per-expert load fill up. A trained router spreads tokens evenly. An untrained or collapsed router sends almost everything to one expert — so you pay for $N$ experts but get the throughput of one. With a finite capacity factor, tokens that overflow their expert are dropped.
Try this
- Switch to Collapsed: one expert eats almost every token; capacity is hit fast and the rest are dropped.
- Switch to Biased: a single expert is over-preferred but others still see traffic.
- Lower the capacity factor: even a balanced router drops a few tokens because of variance — the cost of guaranteed throughput.
- Raise $K$ from 1→2: redundant routing absorbs imbalance at the cost of more compute per token.