← Back to Catalog

Floating-Point Formats for ML

See exactly which bits represent sign, exponent, and mantissa — and why the choice of format matters for hardware efficiency.

Type any float

Slide or type to watch all four formats update in real time.

Format Properties at a Glance

Format Total bits Sign Exponent Mantissa Max range Precision (decimal digits) Typical use

Relative Range vs. Precision

BF16 keeps the same exponent width as FP32, preserving dynamic range — but sacrifices mantissa bits for precision.

🎯

Why BF16 beats FP16 for training

FP16 has only 5 exponent bits → tiny range → frequent overflow during gradient updates. BF16 has 8 exponent bits (same as FP32), so gradient magnitudes stay representable without loss-scaling hacks.

Why INT8 is fastest for inference

Integer ops skip the floating-point unit entirely. An A100 delivers 2× more INT8 TOPS than FP16 TFLOPS. INT8 also halves memory bandwidth vs FP16, a key bottleneck for LLM decoding.

🔢

The exponent bias

Exponents are stored unsigned with a bias (127 for FP32, 15 for FP16). Actual exponent = stored value − bias. This lets the bit pattern sort naturally while representing both tiny and huge numbers.

📉

Memory bandwidth savings

A 7B parameter model: FP32 ≈ 28 GB, FP16/BF16 ≈ 14 GB, INT8 ≈ 7 GB, INT4 ≈ 3.5 GB. Halving bit-width halves the bandwidth needed to stream weights — directly raising arithmetic intensity on the roofline.