Mixture of Experts (MoE) Routing Visualizer

Section 1 · Why MoE

From a dense FFN to a sparse mixture

A transformer's feed-forward block is the dominant compute cost per token. MoE keeps the same per-token work small by activating only $K$ experts, while letting the model's total parameter count grow with $N$. The router is a tiny linear layer; the experts are full-size FFNs.

Dense FFN

token $x$

FFN

$h_{ff}$ hidden, $\approx 2 d \cdot h_{ff}$ params

output $y$

Every token passes through the same FFN. Activated params = total params.

MoE Layer

token $x$

Router

$W_r \in \mathbb{R}^{N\times d}$, top-$K$

$E_1$

$E_2$

$E_3$

$E_4$

solid edge = chosen by top-$K$

$y = \sum_{i \in \text{top-}K} w_i \cdot E_i(x)$

Each token activates only $K$ experts. Activated params $\approx \tfrac{K}{N} \cdot$ total params.

Compute calculator

Compare an MoE layer to its dense baseline at the same per-expert hidden size.

Experts $N$8

Top-$K$2

Model dim $d$4096

Hidden $h_{ff}$14336

Dense FFN total params

—

MoE total params

—

Activated per token

—

Mixtral 8×7B uses $N=8, K=2$: ~47B total params but only ~13B activated per token. Param counts here use the SwiGLU-style estimate $3 \cdot d \cdot h_{ff}$ per FFN.

Section 2 · The router, one token at a time

How the router decides

The router is a single linear layer $W_r \in \mathbb{R}^{N \times d}$. For a token $x$ it produces $N$ logits, softmaxes them, picks the top-$K$, and renormalizes the kept weights so they sum to 1. Step through it below.

Step 0 / 5

$N$ 6 $K$ 2

0input → 1$W_r x$ → 2softmax → 3top-$K$ → 4renormalize → 5weighted sum

Token vector $x$

token #—

Router weights $W_r$ (preview)

$N \times d$ shown; cell shade = magnitude

step: input

A new token vector is sampled. It is the input to the router.

Logits / probabilities / weights

Experts

Output

$y = $ — (advance to step 5)

Why renormalize?

After top-$K$ selection, the kept softmax probabilities no longer sum to 1. If we used them directly, the router would also be implicitly down-weighting every expert output. Renormalizing means "given the experts I picked, here is the relative confidence between them" — and gives the layer the same output magnitude as a dense FFN at initialization. Mixtral and Switch Transformer both renormalize.

Section 3 · Batch routing & load balance

Why load balancing matters

Stream a batch of tokens through the router and watch the per-expert load fill up. A trained router spreads tokens evenly. An untrained or collapsed router sends almost everything to one expert — so you pay for $N$ experts but get the throughput of one. With a finite capacity factor, tokens that overflow their expert are dropped.

Tokens 24 $N$ 6 $K$ 2 Capacity factor 1.25

Router mode:

Routing stream

Each token is routed to its top-$K$ experts. Lines fade in as routing decisions are made.

Per-expert load

red marker = capacity = ⌈capacity_factor × T·K / N⌉

Routed

Dropped

Imbalance

0.00

Imbalance = coefficient of variation (std / mean) of the router's pre-capacity routing decisions. 0 is perfect. Computing it before capacity matters because capping by capacity hides the bias — the Switch Transformer aux loss penalizes the pre-capacity distribution directly.

Try this

Switch to Collapsed: one expert eats almost every token; capacity is hit fast and the rest are dropped.
Switch to Biased: a single expert is over-preferred but others still see traffic.
Lower the capacity factor: even a balanced router drops a few tokens because of variance — the cost of guaranteed throughput.
Raise $K$ from 1→2: redundant routing absorbs imbalance at the cost of more compute per token.