"Sparse embeddings starve the bandwidth, dense MLPs starve the compute."
Press Next to walk through one DLRM forward pass. Try the three mode buttons above — each rebuilds the model architecture so you can see embedding-heavy and MLP-heavy DLRMs visually.
Embedding lookups sit far left — memory-bound.
Embedding tables in production DLRMs are terabyte-scale. Each lookup touches only a tiny slice but the access pattern is irregular and uncacheable, so HBM bandwidth — not FLOPs — bounds throughput. MLPs are the opposite: small, dense, and Tensor-Core-friendly. This is why DLRM motivated specialized hardware like TPU v4/v7 and Meta's MTIA.
Same DLRM forward pass, three different model shapes. The architectures are drawn to scale — bigger boxes mean more memory or compute. Bars at the bottom of each panel show hardware pressure at the embedding lookup step (Step 2) and the top MLP step (Step 5).
Many large tables, shallow MLPs. Step 2 saturates HBM; tensor cores barely move. This is what production ad-ranking DLRMs look like — and why they need HBM-rich, embedding-aware accelerators.
Moderate tables and MLPs. Both bottlenecks hit during the forward pass. The roofline dot bounces between the two regions as you step through.
Few small tables, deep MLPs. HBM is comfortable; tensor cores run hot. Looks more like a normal deep network — and benefits from generic GPU acceleration.