DLRM: Memory Bandwidth & Compute Visualizer

"Sparse embeddings starve the bandwidth, dense MLPs starve the compute."

Step 1 / 6
Balanced 6 embedding tables (~12 GB) · Bottom MLP [13→64→32→16] · Top MLP [31→128→64→1]
1. Inputs
2. Embedding Lookups
3. Bottom MLP
4. Feature Interaction
5. Top MLP
6. Click Pred.
Input Sample (User Click Event)
Sparse / Categorical Features
Dense / Continuous Features
Embedding Tables (HBM)
6 tables · Total: ~12 GB
Interaction
Dot product: \(x_i \cdot x_j\)
Pairwise: \(O(n^2)\)
MLP Towers
Bottom MLP (4 layers)
Top MLP (4 layers)
Click Probability
σ(z)
GPU Hardware Pressure (this step)
HBM Bandwidth 0%
Peak: 2 TB/s · Used: 0 GB/s
Tensor Core / FLOPs 0%
Peak: 312 TFLOPs · Used: 0 TFLOPs
Bottleneck: idle — press Next to begin

What's happening?

Press Next to walk through one DLRM forward pass. Try the three mode buttons above — each rebuilds the model architecture so you can see embedding-heavy and MLP-heavy DLRMs visually.

DLRM Forward Pass

\(e_i = E_i[\text{idx}_i]\) (lookup)
\(d = \text{MLP}_\text{bot}(x_\text{dense})\)
\(z_{ij} = e_i \cdot e_j\) (dot)
\(\hat{y} = \sigma(\text{MLP}_\text{top}([d; z]))\)

Roofline Position

compute-bound mem-bound Arithmetic Intensity (FLOPs/Byte)

Embedding lookups sit far left — memory-bound.

Step Stats

Bytes accessed
0 KB
FLOPs
0
Arith. intensity
— FLOP/B
Access pattern

Why this matters

Embedding tables in production DLRMs are terabyte-scale. Each lookup touches only a tiny slice but the access pattern is irregular and uncacheable, so HBM bandwidth — not FLOPs — bounds throughput. MLPs are the opposite: small, dense, and Tensor-Core-friendly. This is why DLRM motivated specialized hardware like TPU v4/v7 and Meta's MTIA.

Reference: Naumov et al., Deep Learning Recommendation Model for Personalization and Recommendation Systems, Meta AI, arXiv:1906.00091. Course: EE508 Week 9.