EE508 · Systems for ML · 2024 Architecture Deep-Dive

Mixture of Depths vs Mixture of Experts

Mixture of Depths (MoD) routes tokens across transformer layers. Mixture of Experts (MoE) routes tokens across FFN experts. Together they cut FLOPs by up to 88% vs dense — this is the architecture behind DeepSeek-V3. Interact with both, compare them head-to-head, and see why MoD is the idea MoE was missing.

What is Sparse Activation?

Dense FFN

Traditional Transformer

Every token passes through one large FFN. All parameters active on all tokens. Compute ∝ parameters.

Sparse MoE

Mixture of Experts

FFN split into N Expert sub-networks. Each token activates only top-k experts. Parameters ↑, FLOPs/token stays constant.

Router

Gating Network

Small linear layer: token embedding → N logits → softmax → top-k selection. Learned end-to-end with the model.

Result

Mixtral 8×7B

46.7B total params, 12.9B active per token. Beats Llama-2 70B on most benchmarks at 6× less compute.

Animate a Token Through the MoE Layer

Experts (N)8
Top-k2
> Click "Animate Token" to watch a token route through the MoE layer.

Where Does MoE Sit in the Transformer?

MoE replaces the FFN sublayer inside each transformer block. Every other component — token embeddings, multi-head attention, layer norm, residual connections — stays identical to a dense transformer. Toggle between architectures and animate a token flowing through the full stack.

Layers6
Experts / layer8
Architecture
Select an architecture and click Animate Token to watch a token flow through all layers.

Layer-by-Layer Breakdown

Multi-Head Attention

Identical in dense and MoE. Each token attends to all others in the sequence. Cost = O(seq²·d). Not changed by MoE.

MoE FFN Sublayer

Replaces the single FFN. Router selects top-k of N experts. Only k expert FFNs execute — rest are skipped. Residual + LayerNorm wrap it exactly as before.

Residual Stream

The residual connection bypasses each sublayer. In MoE, dropped tokens (over capacity) ride the residual unchanged — their representation is not updated by that expert layer.

Mixed Architecture

Many real models alternate: some layers use dense FFN (for stability at early/late layers), others use MoE. Mixtral uses MoE at every FFN layer. GPT-4 reportedly uses mixed.

Load Balancing — The Core Training Challenge

Experts (N)8
Tokens/batch64
Capacity factor C1.25
Collapse bias5
Mode
Select a mode and click Re-sample.

Why Balance Matters

Expert Collapse

Without regularization the router picks 1–2 favorites. Others receive ~0 tokens — wasted parameters, no gradient, stuck training.

Auxiliary Loss

L_aux = α·N·Σ fᵢ·Pᵢ

f_i = fraction of tokens routed to expert i (pre-capacity). Penalizes imbalance differentiably.

Capacity & Dropping

capacity = ⌊T/N⌋×C. Tokens over capacity are dropped — residual bypass. Red hatching shows overflow in the chart above.

Real-World

Mixtral: top-2, no dropping. Switch: top-1, C=1.25. DeepSeek-V3: aux-loss-free with bias correction, 256 experts.

Hardware Implications — Why MoE is Hard to Deploy

Total Params
Active/Token
Activation %
Memory (bf16)
FLOPs/Token
Experts (N)8
Top-k2
FFN dim4096

All-to-All Communication & Memory Tradeoff

Expert Parallelism

Each GPU holds N/D experts. Tokens dispatched to correct device via All-to-All collective per MoE layer — hard sync barrier.

Comm. Bottleneck

O(B×d_model) bytes per device. Doesn't overlap with compute. Larger N = more experts = more shards = more All-to-All traffic.

Memory vs Compute

MoE 8×7B: ~94GB memory, ~26T FLOPs/tok. Dense 70B: ~140GB, ~140T FLOPs/tok. Pay in memory, save on compute.

Token Dropping

Buffer overflow → token bypasses expert via residual. Introduces approximation error. Careful capacity tuning is critical for quality.

What is Expert Specialization?

Emergent Behavior

Nobody programs experts to specialize. It emerges from training — experts that handle certain token types well receive more gradient for those tokens and naturally develop domain focus.

Mixtral Findings

Analysis of Mixtral 8×7B routing (Jiang et al., 2024) shows domain clustering: some experts strongly prefer code tokens, others math, others multilingual text.

Why It Matters

Specialization is why MoE beats dense at equal FLOPs — experts become efficient specialists. It also enables interpretability: probe which expert "knows" a domain.

Not Perfect

Soft and distributed, not hard-coded. Load balancing constrains it. Early layers specialize less. The same token type may split across multiple experts.

Routing Heatmap — Token Category × Expert Affinity

Brighter = stronger affinity. Based on approximate findings from Mixtral 8×7B layer analysis. Click a row to highlight.

Click any row to highlight that token category and see which experts it prefers.

Training Dynamics — Watch Specialization Emerge

Training progress0%
At initialization (0%), routing is nearly uniform across all experts.

Live Token Stream

Select a token above to animate its routing path.

What is Mixture of Depths (MoD)?

MoD (Raposo et al., Google DeepMind, 2024) extends sparse activation from the width dimension to the depth dimension. Instead of asking "which expert processes this token?", MoD asks: "does this token even need to go through this layer?"

MoE — Width Sparse

Routes Across Experts

All tokens process every layer. Within each FFN, only k of N experts are activated. Saves compute inside the FFN sublayer.

MoD — Depth Sparse

Routes Across Layers

At each layer, a router decides which tokens to process. Tokens that don't "need" this layer skip it entirely via the residual — zero compute for those tokens at that layer.

MoE + MoD

Sparse in Both Dimensions

Combine both: each token selects which layers to pass through (MoD), and within active layers selects which expert (MoE). DeepSeek-V3 approaches this design.

Key Result

Same Quality, Less Compute

MoD models match isoFLOP dense baselines while using significantly fewer FLOPs/token. At 12.5% capacity, a 12-layer MoD uses ~1.5 layers of compute per token on average.

MoD — Animate Token Paths Through the Layer Stack

Layers8
Capacity (% pass-through)50%
Tokens to show6
Click Animate to watch tokens selectively skip layers based on the router's decision.

MoD vs MoE — Compute Savings Explorer

MoD FLOPs/Token
MoE FLOPs/Token
MoD Saving vs Dense
MoE Saving vs Dense
MoD capacity %50%
MoE experts (N)8
MoE top-k2
Adjust sliders to compare compute costs across architectures.

Head-to-Head: Dense vs MoE vs MoD vs MoE+MoD

Layers6
MoD capacity %50%
MoE experts8
Click Animate All to see the same token processed by all four architectures simultaneously.

Full Comparison Table

DimensionDenseSparse MoEMixture of DepthsMoE + MoD
Sparse dimensionNoneWidth (experts)Depth (layers)Width + Depth
Total parametersBaselineN× more (one per expert)Same as denseN× more
Active params/token100%k/N of FFN paramsC% of all layersk/N × C%
FLOPs/token vs dense~25% (top-2 of 8)~50% (50% capacity)~12–15%
Memory requirementBaselineN× (all experts loaded)Same as denseN× (all experts)
Routing overheadNone1 router per layer1 router per layer2 routers per layer
CommunicationLowAll-to-All per layerLow (no expert dispatch)All-to-All per active layer
Training stabilityHighModerateHighModerate
Load balancing neededNoYes (aux loss)Inherent (capacity)Yes (both routers)
Expert specializationNoneYes — emergentLayer importance learnedBoth
Key modelGPT-2, LLaMAMixtral, SwitchMoD (DeepMind 2024)DeepSeek-V3 (approx)

Quality vs Compute — Three Scaling Curves

Compute budget (TFLOPs/token)20
Dense
Sparse MoE
MoE + MoD
Your budget
Drag to explore quality at any compute budget.