EE508 Quantization Lab

INT8 Quantization & Calibration Visualizer

Move FP32 activation values into low-precision integer bins, then watch dequantization and error appear as calibration range changes.

FP32 inference Uses 32-bit floating point values.

INT8 inference Maps continuous values to discrete integer bins.

Memory density The same memory budget can hold four times more INT8 values.

From FP32 to bins

Quantization replaces continuous FP32 activations with a finite set of integer codes.

Why it helps

Lower precision can reduce memory traffic and improve accelerator throughput when hardware supports it.

Calibration matters

The calibration range decides which real values fit inside the integer range.

Too narrow

A narrow range gives a smaller step size but clips values outside the calibrated interval.

Too wide

A wide range avoids clipping but spreads bins out, increasing rounding error for common values.

Controls

Change precision, range, and data shape. Metrics update immediately.

Bit width

2-bit 4-bit 8-bit

Quantization mode

Symmetric Asymmetric

Activation distribution Calibration min -4 -2.0 Calibration max -3.9 2.0 Animation speed Slow 1.2x

Formula Panel

The same equations power every dot in the animation.

scale = (x_max - x_min) / (q_max - q_min) q = round(x / scale + zero_point) x_hat = scale * (q - zero_point) error = x - x_hat

FP32 value integer q x_hat/error

Scale0

Zero point0

MSE0

MAE0

SQNR0 dB

Clipped0

Clipped %0%

Value	q	x_hat	Error

Error and Calibration

Narrow calibration ranges reduce the quantization step, so in-range values can be represented more precisely.
Values outside the range are clipped to the nearest endpoint, which can dominate error when outliers matter.
Wide ranges avoid clipping but increase the step size, so ordinary values suffer more rounding error.
Good calibration balances clipping error and rounding error for the activation distribution seen during inference.

Symmetric vs. Asymmetric

Symmetric quantization uses a signed range such as INT8 [-127, 127], forces zero_point = 0, and expands the calibration interval to max(abs(min), abs(max)).

Asymmetric quantization uses an unsigned range such as UINT8 [0, 255], computes zero_point from the selected min/max range, and clamps q into [q_min, q_max].

Hardware Intuition: Memory Traffic

FP32 uses 4 bytes per value. INT8 uses 1 byte per value, so the same memory budget can carry four times as many activation values.

FP32 block 8 values x 4 bytes

Fewer values fit in a fixed memory block, so each value consumes more bandwidth.

INT8 block 32 values x 1 byte