Core Idea
Traditional Transformer FFN
Every token passes through one large FFN. All parameters active on all tokens. Compute scales linearly with model size.
Mixture of Experts FFN
The FFN is split into N Expert sub-networks. Each token activates only top-k experts. Parameters ↑ but FLOPs/token stays constant.
Gating Network
A learned linear router maps token embeddings → expert scores. Softmax + top-k selection decides which experts process each token.
Parameter Efficiency
Mixtral 8×7B: 46.7B total params, only 12.9B active per token — yet beats Llama-2 70B on most benchmarks at 6× less compute.
Architecture — Click "Animate Token" to watch a forward pass
Gating Mechanism — Step by Step
Router Math
Step 1 — Linear Projection
g(x) = W_g · x
Token embedding x projected to N logits — one per expert. W_g is small: d_model × N.
Step 2 — Softmax
p = Softmax(g(x) / τ)
Temperature τ controls sharpness. Low τ → near one-hot. High τ → flat uniform. Training uses τ=1.
Step 3 — Top-k Select
S = TopK(p, k)
Pick k highest-probability experts. Only these run forward pass. All others are skipped entirely.
Step 4 — Weighted Sum
y = Σᵢ∈S p̃ᵢ · Eᵢ(x)
Selected expert outputs combined by renormalized routing weights p̃ᵢ = pᵢ / Σⱼ∈S pⱼ.
Load Balancing — The Core Training Challenge
Why Balance Matters
Expert Collapse
Without regularization the router converges to always picking the same 1–2 experts. 6 of 8 experts receive ~0 tokens — their parameters are wasted.
Auxiliary Loss (Switch)
L_aux = α·N · Σᵢ fᵢ·Pᵢ
f_i = fraction of tokens routed to expert i. P_i = mean router probability. Penalizes imbalance differentiably.
Expert Capacity
Buffer size: capacity = ⌊tokens/N⌋ × C. Tokens beyond capacity are dropped (residual bypass). C=1.0–1.25 is typical.
Real-World Variants
Mixtral: top-2, no dropping. Switch: top-1, capacity=1.25. DeepSeek-V3: 256 experts, top-8 fine-grained routing.
Hardware Implications — Why MoE is Hard to Deploy
All-to-All Communication Pattern
Expert Parallelism
Experts spread across GPUs. Each holds N/D experts. Tokens must be dispatched to the right device — requires an All-to-All collective per MoE layer.
Communication Bottleneck
All-to-All = O(B × d_model) bytes per device pair. Doesn't overlap with compute — hard synchronization barrier at every layer boundary.
Memory vs Compute
Dense 70B: 140GB, high FLOPs. MoE 8×7B: 94GB, low FLOPs. You pay in memory, not arithmetic — often memory-bandwidth bound on A100/H100.
Token Dropping
When an expert buffer overflows, tokens skip expert computation entirely (residual stream passthrough). Introduces approximation but maintains throughput SLA.
Dense vs Sparse MoE — Tradeoffs
| Dimension | Dense Transformer | Sparse MoE | Verdict |
|---|---|---|---|
| Parameter count | 70B | 8×13B = 104B total, 13B active | MoE — more capacity |
| FLOPs per token | ~140T (70B model) | ~26T (top-2 of 8) | MoE — 5× cheaper |
| Memory required | 140 GB (bf16) | 208 GB (all experts) | Dense — fits fewer GPUs |
| Inference throughput | ~1× baseline | ~4–6× baseline | MoE — less compute/token |
| Training stability | High | Moderate (collapse risk) | Dense |
| Communication cost | Low | High (All-to-All per layer) | Dense |
| Quality at same FLOPs | Baseline | +20–40% better loss | MoE |
| Serving complexity | Simple | Complex (capacity, routing) | Dense |
| Expert specialization | N/A | Emergent domain experts | MoE — interpretability |
Notable MoE Models
Sparsely-Gated MoE
Shazeer et al. — First large-scale MoE in LM. 137B params, top-2. Proved sparse routing can match dense quality at fraction of FLOPs.
Switch Transformer
Google — Top-1 routing, 1.6T params. First to stabilize MoE at trillion-parameter scale with the auxiliary balancing loss.
Mixtral 8×7B
Mistral AI — Open-weight, 46.7B total / 12.9B active. Beats Llama-2 70B on most benchmarks with 6× less inference compute.
DeepSeek-V3 / MoE
256 experts, top-8 fine-grained routing. Auxiliary-loss-free load balancing. State of the art open-source as of 2024.
Quality vs Compute Budget (Scaling Law Explorer)