See exactly which bits represent sign, exponent, and mantissa — and why the choice of format matters for hardware efficiency.
Slide or type to watch all four formats update in real time.
| Format | Total bits | Sign | Exponent | Mantissa | Max range | Precision (decimal digits) | Typical use |
|---|
BF16 keeps the same exponent width as FP32, preserving dynamic range — but sacrifices mantissa bits for precision.
FP16 has only 5 exponent bits → tiny range → frequent overflow during gradient updates. BF16 has 8 exponent bits (same as FP32), so gradient magnitudes stay representable without loss-scaling hacks.
Integer ops skip the floating-point unit entirely. An A100 delivers 2× more INT8 TOPS than FP16 TFLOPS. INT8 also halves memory bandwidth vs FP16, a key bottleneck for LLM decoding.
Exponents are stored unsigned with a bias (127 for FP32, 15 for FP16). Actual exponent = stored value − bias. This lets the bit pattern sort naturally while representing both tiny and huge numbers.
A 7B parameter model: FP32 ≈ 28 GB, FP16/BF16 ≈ 14 GB, INT8 ≈ 7 GB, INT4 ≈ 3.5 GB. Halving bit-width halves the bandwidth needed to stream weights — directly raising arithmetic intensity on the roofline.