Back to catalog
EE508 Quantization Lab

INT8 Quantization & Calibration Visualizer

Move FP32 activation values into low-precision integer bins, then watch dequantization and error appear as calibration range changes.

32
FP32 inference Uses 32-bit floating point values.
8
INT8 inference Maps continuous values to discrete integer bins.
4x
Memory density The same memory budget can hold four times more INT8 values.
From FP32 to bins

Quantization replaces continuous FP32 activations with a finite set of integer codes.

Why it helps

Lower precision can reduce memory traffic and improve accelerator throughput when hardware supports it.

Calibration matters

The calibration range decides which real values fit inside the integer range.

Too narrow

A narrow range gives a smaller step size but clips values outside the calibrated interval.

Too wide

A wide range avoids clipping but spreads bins out, increasing rounding error for common values.

FP32 value -> integer q -> dequantized x_hat -> error

Ready. Press Animate Quantization to step through the values.

pending quantized clipped
FP32 value integer q x_hat/error
Scale0
Zero point0
MSE0
MAE0
SQNR0 dB
Clipped0
Clipped %0%
Value q x_hat Error

Error and Calibration

  • Narrow calibration ranges reduce the quantization step, so in-range values can be represented more precisely.
  • Values outside the range are clipped to the nearest endpoint, which can dominate error when outliers matter.
  • Wide ranges avoid clipping but increase the step size, so ordinary values suffer more rounding error.
  • Good calibration balances clipping error and rounding error for the activation distribution seen during inference.

Symmetric vs. Asymmetric

Symmetric quantization uses a signed range such as INT8 [-127, 127], forces zero_point = 0, and expands the calibration interval to max(abs(min), abs(max)).

Asymmetric quantization uses an unsigned range such as UINT8 [0, 255], computes zero_point from the selected min/max range, and clamps q into [q_min, q_max].

Hardware Intuition: Memory Traffic

FP32 uses 4 bytes per value. INT8 uses 1 byte per value, so the same memory budget can carry four times as many activation values.

FP32 block 8 values x 4 bytes

Fewer values fit in a fixed memory block, so each value consumes more bandwidth.

INT8 block 32 values x 1 byte

More values fit in the same memory footprint. Real speedup still depends on hardware support, memory bandwidth, and kernel implementation.