Mixture of Experts (MoE) — Interactive Visualizer

Core Idea

Dense FFN

Traditional Transformer FFN

Every token passes through one large FFN. All parameters active on all tokens. Compute scales linearly with model size.

Sparse MoE

Mixture of Experts FFN

The FFN is split into N Expert sub-networks. Each token activates only top-k experts. Parameters ↑ but FLOPs/token stays constant.

Router

Gating Network

A learned linear router maps token embeddings → expert scores. Softmax + top-k selection decides which experts process each token.

Key Insight

Parameter Efficiency

Mixtral 8×7B: 46.7B total params, only 12.9B active per token — yet beats Llama-2 70B on most benchmarks at 6× less compute.

Architecture — Click "Animate Token" to watch a forward pass

Experts (N) 8

Top-k 2

> Ready — click "Animate Token" to watch a token route through the MoE layer.

Gating Mechanism — Step by Step

Experts (N) 6

Top-k 2

Temperature τ 1.0

Click New Token to generate a fresh routing example.

Router Math

Step 1 — Linear Projection

g(x) = W_g · x

Token embedding x projected to N logits — one per expert. W_g is small: d_model × N.

Step 2 — Softmax

p = Softmax(g(x) / τ)

Temperature τ controls sharpness. Low τ → near one-hot. High τ → flat uniform. Training uses τ=1.

Step 3 — Top-k Select

S = TopK(p, k)

Pick k highest-probability experts. Only these run forward pass. All others are skipped entirely.

Step 4 — Weighted Sum

y = Σᵢ∈S p̃ᵢ · Eᵢ(x)

Selected expert outputs combined by renormalized routing weights p̃ᵢ = pᵢ / Σⱼ∈S pⱼ.

Load Balancing — The Core Training Challenge

Experts (N) 8

Tokens/batch 64

Collapse bias 5

Mode

Select a mode and click Re-sample to simulate routing distributions.

Why Balance Matters

Expert Collapse

Without regularization the router converges to always picking the same 1–2 experts. 6 of 8 experts receive ~0 tokens — their parameters are wasted.

Auxiliary Loss (Switch)

L_aux = α·N · Σᵢ fᵢ·Pᵢ

f_i = fraction of tokens routed to expert i. P_i = mean router probability. Penalizes imbalance differentiably.

Expert Capacity

Buffer size: capacity = ⌊tokens/N⌋ × C. Tokens beyond capacity are dropped (residual bypass). C=1.0–1.25 is typical.

Real-World Variants

Mixtral: top-2, no dropping. Switch: top-1, capacity=1.25. DeepSeek-V3: 256 experts, top-8 fine-grained routing.

Hardware Implications — Why MoE is Hard to Deploy

—

Total Params

—

Active Params/Token

—

Sparsity Ratio

—

Memory (bf16)

—

FLOPs/Token

Experts (N) 8

Top-k 2

Expert FFN dim 4096

All-to-All Communication Pattern

Expert Parallelism

Experts spread across GPUs. Each holds N/D experts. Tokens must be dispatched to the right device — requires an All-to-All collective per MoE layer.

Communication Bottleneck

All-to-All = O(B × d_model) bytes per device pair. Doesn't overlap with compute — hard synchronization barrier at every layer boundary.

Memory vs Compute

Dense 70B: 140GB, high FLOPs. MoE 8×7B: 94GB, low FLOPs. You pay in memory, not arithmetic — often memory-bandwidth bound on A100/H100.

Token Dropping

When an expert buffer overflows, tokens skip expert computation entirely (residual stream passthrough). Introduces approximation but maintains throughput SLA.

Dense vs Sparse MoE — Tradeoffs

Dimension	Dense Transformer	Sparse MoE	Verdict
Parameter count	70B	8×13B = 104B total, 13B active	MoE — more capacity
FLOPs per token	~140T (70B model)	~26T (top-2 of 8)	MoE — 5× cheaper
Memory required	140 GB (bf16)	208 GB (all experts)	Dense — fits fewer GPUs
Inference throughput	~1× baseline	~4–6× baseline	MoE — less compute/token
Training stability	High	Moderate (collapse risk)	Dense
Communication cost	Low	High (All-to-All per layer)	Dense
Quality at same FLOPs	Baseline	+20–40% better loss	MoE
Serving complexity	Simple	Complex (capacity, routing)	Dense
Expert specialization	N/A	Emergent domain experts	MoE — interpretability

Notable MoE Models

2017

Sparsely-Gated MoE

Shazeer et al. — First large-scale MoE in LM. 137B params, top-2. Proved sparse routing can match dense quality at fraction of FLOPs.

2021

Switch Transformer

Google — Top-1 routing, 1.6T params. First to stabilize MoE at trillion-parameter scale with the auxiliary balancing loss.

2023

Mixtral 8×7B

Mistral AI — Open-weight, 46.7B total / 12.9B active. Beats Llama-2 70B on most benchmarks with 6× less inference compute.

2024

DeepSeek-V3 / MoE

256 experts, top-8 fine-grained routing. Auxiliary-loss-free load balancing. State of the art open-source as of 2024.

Quality vs Compute Budget (Scaling Law Explorer)

Compute budget (TFLOPs/token) 20

Dense model

Sparse MoE

Your compute budget

Drag the slider to explore the quality gap at different compute budgets.