Step through how FP32 weights are compressed to low-bit integers — and see exactly where precision is lost.
Original 32-bit float weights.
Appears after Step 3 (dequantize). Bar height = |w − w̃|.
Symmetric maps [−max, max] with zero-point = 0. Asymmetric adds an offset, useful for ReLU activations that are always ≥ 0.
Per-tensor uses one scale for all weights. Per-row assigns one scale per output channel, capturing different magnitudes at a small cost.
One large outlier forces a wide scale, wasting most of the INT8 range. SmoothQuant and GPTQ redistribute outliers before quantizing.