Quantization replaces continuous FP32 activations with a finite set of integer codes.
| Value | q | x_hat | Error |
|---|
Move FP32 activation values into low-precision integer bins, then watch dequantization and error appear as calibration range changes.
Quantization replaces continuous FP32 activations with a finite set of integer codes.
Lower precision can reduce memory traffic and improve accelerator throughput when hardware supports it.
The calibration range decides which real values fit inside the integer range.
A narrow range gives a smaller step size but clips values outside the calibrated interval.
A wide range avoids clipping but spreads bins out, increasing rounding error for common values.
| Value | q | x_hat | Error |
|---|
Symmetric quantization uses a signed range such as INT8 [-127, 127], forces zero_point = 0, and expands the calibration interval to max(abs(min), abs(max)).
Asymmetric quantization uses an unsigned range such as UINT8 [0, 255], computes zero_point from the selected min/max range, and clamps q into [q_min, q_max].
FP32 uses 4 bytes per value. INT8 uses 1 byte per value, so the same memory budget can carry four times as many activation values.
Fewer values fit in a fixed memory block, so each value consumes more bandwidth.
More values fit in the same memory footprint. Real speedup still depends on hardware support, memory bandwidth, and kernel implementation.