DLRM: Memory Bandwidth & Compute Visualizer

"Sparse embeddings starve the bandwidth, dense MLPs starve the compute."

Step 1 / 6

1. Inputs

→

2. Embedding Lookups

→

3. Bottom MLP

→

4. Feature Interaction

→

5. Top MLP

→

6. Click Pred.

Input Sample (User Click Event)

Sparse / Categorical Features

Dense / Continuous Features

Embedding Tables (HBM)

6 tables · Total: ~12 GB

Interaction

Dot product: \(x_i \cdot x_j\)

Pairwise: \(O(n^2)\)

MLP Towers

Bottom MLP (4 layers)

Top MLP (4 layers)

Click Probability

—

σ(z)

GPU Hardware Pressure (this step)

HBM Bandwidth 0%

Peak: 2 TB/s · Used: 0 GB/s

Tensor Core / FLOPs 0%

Peak: 312 TFLOPs · Used: 0 TFLOPs

Bottleneck: idle — press Next to begin

What's happening?

Press Next to walk through one DLRM forward pass. Try the three mode buttons above — each rebuilds the model architecture so you can see embedding-heavy and MLP-heavy DLRMs visually.

DLRM Forward Pass

\(e_i = E_i[\text{idx}_i]\) (lookup)
\(d = \text{MLP}_\text{bot}(x_\text{dense})\)
\(z_{ij} = e_i \cdot e_j\) (dot)
\(\hat{y} = \sigma(\text{MLP}_\text{top}([d; z]))\)

Roofline Position

Embedding lookups sit far left — memory-bound.

Step Stats

Bytes accessed

0 KB

FLOPs

Arith. intensity

— FLOP/B

Access pattern

—

Why this matters

Embedding tables in production DLRMs are terabyte-scale. Each lookup touches only a tiny slice but the access pattern is irregular and uncacheable, so HBM bandwidth — not FLOPs — bounds throughput. MLPs are the opposite: small, dense, and Tensor-Core-friendly. This is why DLRM motivated specialized hardware like TPU v4/v7 and Meta's MTIA.

Reference: Naumov et al., Deep Learning Recommendation Model for Personalization and Recommendation Systems, Meta AI, arXiv:1906.00091. Course: EE508 Week 9.