Visualize macroscopic quantization concepts: QAT, Inference strategies, and Modern LLM Weight-Only methods.
Straight-Through Estimator (STE) in QAT
During training, we need gradients to update FP weights. But the "Quantizer" step function has a gradient of 0 everywhere. The STE routes the gradient straight through.
How to use this simulator:
ACTION: Click the "Apply Gradients" button at the bottom! It simulates a real training loop.
Weights update, quantized integers occasionally flash as they snap to new steps.
Weight r (FP32)
Quantizer
⇨
Quantized Weight Q (INT)
1
2
-2
2
⇨
Forward Pass
⇩
Gradient dL/dr (FP32)
0.6
-0.5
-0.8
0.7
⇦
STE (Identity)
⇦
Gradient dL/dQ (FP)
⇦
Backward Pass
+
⇧
Epoch: 0
Inference Activation Strategies
Weights are quantized statically. Activations use parameters found before deployment (Static) or during (Dynamic).
Incoming Activation Batch (1D Distribution Plot)
r_min
r_max
Original:
Result:Run a strategy below...
Static QuantizationFast
Dynamic QuantizationSlower
Result: Clean
Weight-Only & Grouped Quantization (LLMs)
Data Flow Architecture
Weights INT8
Dequant ⇨
FP16 Math
Weight-Only (W8A16): Fast memory access, high-precision math.
Grouped (Block) Quantization
Granularity & Calibration Strategies
Optimizing precision by picking better ranges (Calibration) and better scale buckets (Granularity).
Calibration (Picking the Range)
-10.00.0+10.0
Min-Max Calibration: Range stretches to the extreme outliers. Bins are large, so precision is extremely low for the main data distribution.
Granularity (Bucketing)
Channel 1Scale: 0.1
Channel 2Scale: 0.1
Channel 3Scale: 0.1
Per-Tensor: One scale for all channels. Because Channel 3 has a huge outlier, everyone must use a wide, low-precision scale. Small channels lose all resolution.