Mixture of Depths (MoD) vs MoE — Interactive Visualizer

What is Sparse Activation?

Dense FFN

Traditional Transformer

Every token passes through one large FFN. All parameters active on all tokens. Compute ∝ parameters.

Sparse MoE

Mixture of Experts

FFN split into N Expert sub-networks. Each token activates only top-k experts. Parameters ↑, FLOPs/token stays constant.

Router

Gating Network

Small linear layer: token embedding → N logits → softmax → top-k selection. Learned end-to-end with the model.

Result

Mixtral 8×7B

46.7B total params, 12.9B active per token. Beats Llama-2 70B on most benchmarks at 6× less compute.

Animate a Token Through the MoE Layer

Experts (N)8

Top-k2

> Click "Animate Token" to watch a token route through the MoE layer.

Where Does MoE Sit in the Transformer?

MoE replaces the FFN sublayer inside each transformer block. Every other component — token embeddings, multi-head attention, layer norm, residual connections — stays identical to a dense transformer. Toggle between architectures and animate a token flowing through the full stack.

Layers6

Experts / layer8

Architecture

Select an architecture and click Animate Token to watch a token flow through all layers.

Layer-by-Layer Breakdown

Multi-Head Attention

Identical in dense and MoE. Each token attends to all others in the sequence. Cost = O(seq²·d). Not changed by MoE.

MoE FFN Sublayer

Replaces the single FFN. Router selects top-k of N experts. Only k expert FFNs execute — rest are skipped. Residual + LayerNorm wrap it exactly as before.

Residual Stream

The residual connection bypasses each sublayer. In MoE, dropped tokens (over capacity) ride the residual unchanged — their representation is not updated by that expert layer.

Mixed Architecture

Many real models alternate: some layers use dense FFN (for stability at early/late layers), others use MoE. Mixtral uses MoE at every FFN layer. GPT-4 reportedly uses mixed.

Load Balancing — The Core Training Challenge

Experts (N)8

Tokens/batch64

Capacity factor C1.25

Collapse bias5

Mode

Select a mode and click Re-sample.

Why Balance Matters

Expert Collapse

Without regularization the router picks 1–2 favorites. Others receive ~0 tokens — wasted parameters, no gradient, stuck training.

Auxiliary Loss

L_aux = α·N·Σ fᵢ·Pᵢ

f_i = fraction of tokens routed to expert i (pre-capacity). Penalizes imbalance differentiably.

Capacity & Dropping

capacity = ⌊T/N⌋×C. Tokens over capacity are dropped — residual bypass. Red hatching shows overflow in the chart above.

Real-World

Mixtral: top-2, no dropping. Switch: top-1, C=1.25. DeepSeek-V3: aux-loss-free with bias correction, 256 experts.

Hardware Implications — Why MoE is Hard to Deploy

—

Total Params

—

Active/Token

—

Activation %

—

Memory (bf16)

—

FLOPs/Token

Experts (N)8

Top-k2

FFN dim4096

All-to-All Communication & Memory Tradeoff

Expert Parallelism

Each GPU holds N/D experts. Tokens dispatched to correct device via All-to-All collective per MoE layer — hard sync barrier.

Comm. Bottleneck

O(B×d_model) bytes per device. Doesn't overlap with compute. Larger N = more experts = more shards = more All-to-All traffic.

Memory vs Compute

MoE 8×7B: ~94GB memory, ~26T FLOPs/tok. Dense 70B: ~140GB, ~140T FLOPs/tok. Pay in memory, save on compute.

Token Dropping

Buffer overflow → token bypasses expert via residual. Introduces approximation error. Careful capacity tuning is critical for quality.

What is Expert Specialization?

Emergent Behavior

Nobody programs experts to specialize. It emerges from training — experts that handle certain token types well receive more gradient for those tokens and naturally develop domain focus.

Mixtral Findings

Analysis of Mixtral 8×7B routing (Jiang et al., 2024) shows domain clustering: some experts strongly prefer code tokens, others math, others multilingual text.

Why It Matters

Specialization is why MoE beats dense at equal FLOPs — experts become efficient specialists. It also enables interpretability: probe which expert "knows" a domain.

Not Perfect

Soft and distributed, not hard-coded. Load balancing constrains it. Early layers specialize less. The same token type may split across multiple experts.

Routing Heatmap — Token Category × Expert Affinity

Brighter = stronger affinity. Based on approximate findings from Mixtral 8×7B layer analysis. Click a row to highlight.

Click any row to highlight that token category and see which experts it prefers.

Training Dynamics — Watch Specialization Emerge

Training progress0%

At initialization (0%), routing is nearly uniform across all experts.

Live Token Stream

Select a token above to animate its routing path.

What is Mixture of Depths (MoD)?

MoD (Raposo et al., Google DeepMind, 2024) extends sparse activation from the width dimension to the depth dimension. Instead of asking "which expert processes this token?", MoD asks: "does this token even need to go through this layer?"

MoE — Width Sparse

Routes Across Experts

All tokens process every layer. Within each FFN, only k of N experts are activated. Saves compute inside the FFN sublayer.

MoD — Depth Sparse

Routes Across Layers

At each layer, a router decides which tokens to process. Tokens that don't "need" this layer skip it entirely via the residual — zero compute for those tokens at that layer.

MoE + MoD

Sparse in Both Dimensions

Combine both: each token selects which layers to pass through (MoD), and within active layers selects which expert (MoE). DeepSeek-V3 approaches this design.

Key Result

Same Quality, Less Compute

MoD models match isoFLOP dense baselines while using significantly fewer FLOPs/token. At 12.5% capacity, a 12-layer MoD uses ~1.5 layers of compute per token on average.

MoD — Animate Token Paths Through the Layer Stack

Layers8

Capacity (% pass-through)50%

Tokens to show6

Click Animate to watch tokens selectively skip layers based on the router's decision.

MoD vs MoE — Compute Savings Explorer

—

MoD FLOPs/Token

—

MoE FLOPs/Token

—

MoD Saving vs Dense

—

MoE Saving vs Dense

MoD capacity %50%

MoE experts (N)8

MoE top-k2

Adjust sliders to compare compute costs across architectures.

Head-to-Head: Dense vs MoE vs MoD vs MoE+MoD

Layers6

MoD capacity %50%

MoE experts8

Click Animate All to see the same token processed by all four architectures simultaneously.

Full Comparison Table

Dimension	Dense	Sparse MoE	Mixture of Depths	MoE + MoD
Sparse dimension	None	Width (experts)	Depth (layers)	Width + Depth
Total parameters	Baseline	N× more (one per expert)	Same as dense	N× more
Active params/token	100%	k/N of FFN params	C% of all layers	k/N × C%
FLOPs/token vs dense	1×	~25% (top-2 of 8)	~50% (50% capacity)	~12–15%
Memory requirement	Baseline	N× (all experts loaded)	Same as dense	N× (all experts)
Routing overhead	None	1 router per layer	1 router per layer	2 routers per layer
Communication	Low	All-to-All per layer	Low (no expert dispatch)	All-to-All per active layer
Training stability	High	Moderate	High	Moderate
Load balancing needed	No	Yes (aux loss)	Inherent (capacity)	Yes (both routers)
Expert specialization	None	Yes — emergent	Layer importance learned	Both
Key model	GPT-2, LLaMA	Mixtral, Switch	MoD (DeepMind 2024)	DeepSeek-V3 (approx)

Quality vs Compute — Three Scaling Curves

Compute budget (TFLOPs/token)20

Dense

Sparse MoE

MoE + MoD

Your budget

Drag to explore quality at any compute budget.