What is Sparse Activation?
Traditional Transformer
Every token passes through one large FFN. All parameters active on all tokens. Compute ∝ parameters.
Mixture of Experts
FFN split into N Expert sub-networks. Each token activates only top-k experts. Parameters ↑, FLOPs/token stays constant.
Gating Network
Small linear layer: token embedding → N logits → softmax → top-k selection. Learned end-to-end with the model.
Mixtral 8×7B
46.7B total params, 12.9B active per token. Beats Llama-2 70B on most benchmarks at 6× less compute.
Animate a Token Through the MoE Layer
Where Does MoE Sit in the Transformer?
MoE replaces the FFN sublayer inside each transformer block. Every other component — token embeddings, multi-head attention, layer norm, residual connections — stays identical to a dense transformer. Toggle between architectures and animate a token flowing through the full stack.
Layer-by-Layer Breakdown
Multi-Head Attention
Identical in dense and MoE. Each token attends to all others in the sequence. Cost = O(seq²·d). Not changed by MoE.
MoE FFN Sublayer
Replaces the single FFN. Router selects top-k of N experts. Only k expert FFNs execute — rest are skipped. Residual + LayerNorm wrap it exactly as before.
Residual Stream
The residual connection bypasses each sublayer. In MoE, dropped tokens (over capacity) ride the residual unchanged — their representation is not updated by that expert layer.
Mixed Architecture
Many real models alternate: some layers use dense FFN (for stability at early/late layers), others use MoE. Mixtral uses MoE at every FFN layer. GPT-4 reportedly uses mixed.
Load Balancing — The Core Training Challenge
Why Balance Matters
Expert Collapse
Without regularization the router picks 1–2 favorites. Others receive ~0 tokens — wasted parameters, no gradient, stuck training.
Auxiliary Loss
L_aux = α·N·Σ fᵢ·Pᵢ
f_i = fraction of tokens routed to expert i (pre-capacity). Penalizes imbalance differentiably.
Capacity & Dropping
capacity = ⌊T/N⌋×C. Tokens over capacity are dropped — residual bypass. Red hatching shows overflow in the chart above.
Real-World
Mixtral: top-2, no dropping. Switch: top-1, C=1.25. DeepSeek-V3: aux-loss-free with bias correction, 256 experts.
Hardware Implications — Why MoE is Hard to Deploy
All-to-All Communication & Memory Tradeoff
Expert Parallelism
Each GPU holds N/D experts. Tokens dispatched to correct device via All-to-All collective per MoE layer — hard sync barrier.
Comm. Bottleneck
O(B×d_model) bytes per device. Doesn't overlap with compute. Larger N = more experts = more shards = more All-to-All traffic.
Memory vs Compute
MoE 8×7B: ~94GB memory, ~26T FLOPs/tok. Dense 70B: ~140GB, ~140T FLOPs/tok. Pay in memory, save on compute.
Token Dropping
Buffer overflow → token bypasses expert via residual. Introduces approximation error. Careful capacity tuning is critical for quality.
What is Expert Specialization?
Emergent Behavior
Nobody programs experts to specialize. It emerges from training — experts that handle certain token types well receive more gradient for those tokens and naturally develop domain focus.
Mixtral Findings
Analysis of Mixtral 8×7B routing (Jiang et al., 2024) shows domain clustering: some experts strongly prefer code tokens, others math, others multilingual text.
Why It Matters
Specialization is why MoE beats dense at equal FLOPs — experts become efficient specialists. It also enables interpretability: probe which expert "knows" a domain.
Not Perfect
Soft and distributed, not hard-coded. Load balancing constrains it. Early layers specialize less. The same token type may split across multiple experts.
Routing Heatmap — Token Category × Expert Affinity
Brighter = stronger affinity. Based on approximate findings from Mixtral 8×7B layer analysis. Click a row to highlight.
Training Dynamics — Watch Specialization Emerge
Live Token Stream
What is Mixture of Depths (MoD)?
MoD (Raposo et al., Google DeepMind, 2024) extends sparse activation from the width dimension to the depth dimension. Instead of asking "which expert processes this token?", MoD asks: "does this token even need to go through this layer?"
Routes Across Experts
All tokens process every layer. Within each FFN, only k of N experts are activated. Saves compute inside the FFN sublayer.
Routes Across Layers
At each layer, a router decides which tokens to process. Tokens that don't "need" this layer skip it entirely via the residual — zero compute for those tokens at that layer.
Sparse in Both Dimensions
Combine both: each token selects which layers to pass through (MoD), and within active layers selects which expert (MoE). DeepSeek-V3 approaches this design.
Same Quality, Less Compute
MoD models match isoFLOP dense baselines while using significantly fewer FLOPs/token. At 12.5% capacity, a 12-layer MoD uses ~1.5 layers of compute per token on average.
MoD — Animate Token Paths Through the Layer Stack
MoD vs MoE — Compute Savings Explorer
Head-to-Head: Dense vs MoE vs MoD vs MoE+MoD
Full Comparison Table
| Dimension | Dense | Sparse MoE | Mixture of Depths | MoE + MoD |
|---|---|---|---|---|
| Sparse dimension | None | Width (experts) | Depth (layers) | Width + Depth |
| Total parameters | Baseline | N× more (one per expert) | Same as dense | N× more |
| Active params/token | 100% | k/N of FFN params | C% of all layers | k/N × C% |
| FLOPs/token vs dense | 1× | ~25% (top-2 of 8) | ~50% (50% capacity) | ~12–15% |
| Memory requirement | Baseline | N× (all experts loaded) | Same as dense | N× (all experts) |
| Routing overhead | None | 1 router per layer | 1 router per layer | 2 routers per layer |
| Communication | Low | All-to-All per layer | Low (no expert dispatch) | All-to-All per active layer |
| Training stability | High | Moderate | High | Moderate |
| Load balancing needed | No | Yes (aux loss) | Inherent (capacity) | Yes (both routers) |
| Expert specialization | None | Yes — emergent | Layer importance learned | Both |
| Key model | GPT-2, LLaMA | Mixtral, Switch | MoD (DeepMind 2024) | DeepSeek-V3 (approx) |
Quality vs Compute — Three Scaling Curves