EE508 · Systems for ML · Interactive Visualizer

Mixture of Experts

Explore sparse activation, token routing, load balancing, and hardware implications of MoE — the architecture behind GPT-4, Mixtral, Gemini, and every frontier model today.

Core Idea

Dense FFN

Traditional Transformer FFN

Every token passes through one large FFN. All parameters active on all tokens. Compute scales linearly with model size.

Sparse MoE

Mixture of Experts FFN

The FFN is split into N Expert sub-networks. Each token activates only top-k experts. Parameters ↑ but FLOPs/token stays constant.

Router

Gating Network

A learned linear router maps token embeddings → expert scores. Softmax + top-k selection decides which experts process each token.

Key Insight

Parameter Efficiency

Mixtral 8×7B: 46.7B total params, only 12.9B active per token — yet beats Llama-2 70B on most benchmarks at 6× less compute.

Architecture — Click "Animate Token" to watch a forward pass

Experts (N) 8
Top-k 2
> Ready — click "Animate Token" to watch a token route through the MoE layer.

Gating Mechanism — Step by Step

Experts (N) 6
Top-k 2
Temperature τ 1.0
Click New Token to generate a fresh routing example.

Router Math

Step 1 — Linear Projection

g(x) = W_g · x

Token embedding x projected to N logits — one per expert. W_g is small: d_model × N.

Step 2 — Softmax

p = Softmax(g(x) / τ)

Temperature τ controls sharpness. Low τ → near one-hot. High τ → flat uniform. Training uses τ=1.

Step 3 — Top-k Select

S = TopK(p, k)

Pick k highest-probability experts. Only these run forward pass. All others are skipped entirely.

Step 4 — Weighted Sum

y = Σᵢ∈S p̃ᵢ · Eᵢ(x)

Selected expert outputs combined by renormalized routing weights p̃ᵢ = pᵢ / Σⱼ∈S pⱼ.

Load Balancing — The Core Training Challenge

Experts (N) 8
Tokens/batch 64
Collapse bias 5
Mode
Select a mode and click Re-sample to simulate routing distributions.

Why Balance Matters

Expert Collapse

Without regularization the router converges to always picking the same 1–2 experts. 6 of 8 experts receive ~0 tokens — their parameters are wasted.

Auxiliary Loss (Switch)

L_aux = α·N · Σᵢ fᵢ·Pᵢ

f_i = fraction of tokens routed to expert i. P_i = mean router probability. Penalizes imbalance differentiably.

Expert Capacity

Buffer size: capacity = ⌊tokens/N⌋ × C. Tokens beyond capacity are dropped (residual bypass). C=1.0–1.25 is typical.

Real-World Variants

Mixtral: top-2, no dropping. Switch: top-1, capacity=1.25. DeepSeek-V3: 256 experts, top-8 fine-grained routing.

Hardware Implications — Why MoE is Hard to Deploy

Total Params
Active Params/Token
Sparsity Ratio
Memory (bf16)
FLOPs/Token
Experts (N) 8
Top-k 2
Expert FFN dim 4096

All-to-All Communication Pattern

Expert Parallelism

Experts spread across GPUs. Each holds N/D experts. Tokens must be dispatched to the right device — requires an All-to-All collective per MoE layer.

Communication Bottleneck

All-to-All = O(B × d_model) bytes per device pair. Doesn't overlap with compute — hard synchronization barrier at every layer boundary.

Memory vs Compute

Dense 70B: 140GB, high FLOPs. MoE 8×7B: 94GB, low FLOPs. You pay in memory, not arithmetic — often memory-bandwidth bound on A100/H100.

Token Dropping

When an expert buffer overflows, tokens skip expert computation entirely (residual stream passthrough). Introduces approximation but maintains throughput SLA.

Dense vs Sparse MoE — Tradeoffs

DimensionDense TransformerSparse MoEVerdict
Parameter count70B8×13B = 104B total, 13B activeMoE — more capacity
FLOPs per token~140T (70B model)~26T (top-2 of 8)MoE — 5× cheaper
Memory required140 GB (bf16)208 GB (all experts)Dense — fits fewer GPUs
Inference throughput~1× baseline~4–6× baselineMoE — less compute/token
Training stabilityHighModerate (collapse risk)Dense
Communication costLowHigh (All-to-All per layer)Dense
Quality at same FLOPsBaseline+20–40% better lossMoE
Serving complexitySimpleComplex (capacity, routing)Dense
Expert specializationN/AEmergent domain expertsMoE — interpretability

Notable MoE Models

2017

Sparsely-Gated MoE

Shazeer et al. — First large-scale MoE in LM. 137B params, top-2. Proved sparse routing can match dense quality at fraction of FLOPs.

2021

Switch Transformer

Google — Top-1 routing, 1.6T params. First to stabilize MoE at trillion-parameter scale with the auxiliary balancing loss.

2023

Mixtral 8×7B

Mistral AI — Open-weight, 46.7B total / 12.9B active. Beats Llama-2 70B on most benchmarks with 6× less inference compute.

2024

DeepSeek-V3 / MoE

256 experts, top-8 fine-grained routing. Auxiliary-loss-free load balancing. State of the art open-source as of 2024.

Quality vs Compute Budget (Scaling Law Explorer)

Compute budget (TFLOPs/token) 20
Dense model
Sparse MoE
Your compute budget
Drag the slider to explore the quality gap at different compute budgets.